presenter: huy vuong, phd department of biomedical informatics vanderbilt university 5/3/2013
DESCRIPTION
Detection of somatic mutations: A data mining and a computational approach. Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013. Somatic single nucleotide variants ( sSNV ). Play major role in tumorigenesis and cancer development - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/1.jpg)
Presenter: Huy Vuong, PhDDepartment of Biomedical InformaticsVanderbilt University5/3/2013
Detection of somatic mutations: A data mining and a computational approach
![Page 2: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/2.jpg)
Somatic single nucleotide variants (sSNV)
• Play major role in tumorigenesis and cancer development
• Aim 1: Literature mining• Catalogue of Somatic Mutations
In Cancer (COSMIC): the most comprehensive catalogue today
• Aim 2: Tumor-specific mutations in tumor-normal pairs
2
V1 (2004) V60 (7/2012)
V61 (9/2012)
V62 (11/2012)
10,647
340,585
405,271
745,924
Mutations in COSMIC
![Page 3: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/3.jpg)
Classes of somatic mutations• Point mutation:
• Coding• Silent• Missense• Nonsense
• Noncoding (UTR, ncRNA, miRNA…)• Intronic• Intergenic
• Small scale mutation: • Small insertions• Small deletions
• Large scale mutation: rearrangements• Intrachromosomal
• Deletion• Invertion• Duplication
• Interchromosomal• Translocation• Insertion
![Page 4: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/4.jpg)
Aim 1: Mining COSMIC For Protein Domain Interaction
4
![Page 5: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/5.jpg)
History of COSMIC
The Evolution of the Cosmos started with the Big Bang!http://en.wikipedia.org/wiki/Big_Bang
![Page 6: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/6.jpg)
Yet, another COSMIC• History of the Catalogue Of Somatic Mutations In Cancer (Wellcome
Trust Sanger Institute)
COSMIC V1(4th February, 2004)
COSMIC V64(26th March, 2013)
Genes Mutations Tumours
4 10,64757,44424,394
913,166847,698
V1 (2004) V64 (2013)
Comparison V1 vs. V64
![Page 7: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/7.jpg)
Advantages and Disadvantages
• Bimonthly updates• Manual curated data,
removed low quality data• Consistent vocabulary
(histology and tissue)• Mutation maps to single
version of gene (no alternative splicing)
• FREE availability!!!
• Curation bias• Many positive results, few
negative results• Other quality issues:
experimental error, missing mutations
• Interpretation of mutation frequency
![Page 8: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/8.jpg)
Typical workflowHistogram
Distribution
![Page 9: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/9.jpg)
Specific aims
• Map somatic mutations (SM) in COSMIC to
protein structural model
• Identify SM in pocket region of protein
• Use statistical analysis to score SM in the
context of cancer (specificity, sensitivity)
![Page 10: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/10.jpg)
Dataset and preprocessing step• Data are downloaded from COSMIC version 62 via Biomart interface
as TSV file (http://cancer.sanger.ac.uk/biomart/martview/)• Use R to clean the data (i.e remove duplicates) and import to a
SQLite database• Database contained 776,917 mutations and 15 variables:
1. Gene.Name 2. CDS.Mutation.Syntax 3. AA.Mutation.Syntax 4. Zygosity 5. Primary.Site 6. Primary.Histology 7. In.Cancer.Census 8. Tumour.Source
9. Genomic.Coordinates.GRCh37 10. CDS.Mutation.Type 11. AA.Mutation.Type 12. Somatic.status 13. Validation.status 14. Entrez.Gene.ID 15. COSMIC.Sample.ID
![Page 11: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/11.jpg)
Vast majority of disease-associated SNPs are located in Pockets. (Tseng and Li, PNAS, 2011)
Protein pocket region • Li et al developed algorithm to identify
functional pocket regions in protein
![Page 12: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/12.jpg)
A case study: KRAS
About 64% of SM in KRAS is located on the functional pocket region
Yu et al (Nature Biotechnology, 2012) also reported about 65% of disease associated in-frame mutations are located on the interaction surfaces of proteins associated with the diseases.
![Page 13: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/13.jpg)
15
Aim 2: Tumor-specific mutations in tumor-normal pairs
![Page 14: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/14.jpg)
Outline
• Challenges in detecting somatic single nucleotide variants (sSNV)
• GATK pipeline for calling sSNV• Installing and running MuTect• MuTect output• Summary
16
![Page 15: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/15.jpg)
Detecting sSNV in cancer: challenge #1
Many sSNV occur at very low frequency in genome (0.1 to 100 mutations per megabase) 17
Slide adapted from Mike Lawrence, TCGA Annual Symposium
![Page 16: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/16.jpg)
C. Tri-clonal tumor
Detecting sSNV in cancer: challenge #2
Tumors are impure (i.e. contain normal contaminating cells) and heterogeneous (i.e. contain sub-clones)
18
Slide adapted from Christopher Miller, TCGA Annual Symposium and Mardis Elaine
![Page 17: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/17.jpg)
GATK pipeline
GATK Best Practices: http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
![Page 18: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/18.jpg)
NGS: Resources
• SEQanswers (http://seqanswers.com/)• SEQanswers software list (http://
seqanswers.com/wiki/Software/list• Galaxy (https://main.g2.bx.psu.edu/)• NGS Catalog (
http://bioinfo.mc.vanderbilt.edu/NGS/)
Slide adapted from Peilin Jia, PhD
![Page 19: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/19.jpg)
Two types of error
• USER ERRORS: • Due to wrong command line or incorrect user
input files• Please do not post this error to the GATK
forum• RUNTIME ERRORS:
• Due to the program code• Do post this error to the GATK forum (together
with the trace file)
![Page 20: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/20.jpg)
USER ERROR• ##### ERROR ------------------------------------------------------------------------------------------• ##### ERROR A USER ERROR has occurred (version 2.2-25-g2a68eab): • ##### ERROR The invalid arguments or inputs must be corrected before the GATK can
proceed• ##### ERROR Please do not post this error to the GATK forum• ##### ERROR• ##### ERROR See the documentation (rerun with -h) for this tool to view allowable
command-line arguments.• ##### ERROR Visit our website and forum for extensive documentation and answers to • ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk• ##### ERROR• ##### ERROR MESSAGE: SAM/BAM file
SAMFileReader{/scratch/vuongh/Lungevity_Project/GATK/bwa/13_karosorted_RG_MarkDup_Realigned_Recal.bam} is malformed: read starts with deletion. Cigar: 9D18M15I38M26S. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar
![Page 21: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/21.jpg)
BEST OF RUNTIME ERROR• ##### ERROR
------------------------------------------------------------------------------------------• ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7-
g5e89f01):• ##### ERROR• ##### ERROR Please visit the wiki to see if this is a known problem• ##### ERROR If not, please post the error, with stack trace, to the GATK
forum• ##### ERROR Visit our website and forum for extensive documentation
and answers to • ##### ERROR commonly asked questions
http://www.broadinstitute.org/gatk• ##### ERROR• ##### ERROR MESSAGE: START (0) > (-1) STOP -- this should never
happen -- call Mauricio!
![Page 22: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/22.jpg)
MuTect: a highly sensitive and specific sSNV caller
• Distinct Features • Focus on identifying low allelic fraction
mutations due to tumor heterogeneity, normal contaminating cell, sub-clones
• Use Bayesian model with allelic fraction as parameter yield high sensitivity
• Carefully tuned , elaborated set of filters yield high specificity
24
![Page 23: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/23.jpg)
Overview of the detection of a somatic point mutation using MuTect
25
Bayesian model
Variant Filter Panel of Normal Filter
Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514
![Page 24: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/24.jpg)
Benchmarking mutation-detection methods
26
Advantages: High sensitivity at low allelic fraction (f=0.1)High specificity achieved by filters
Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514
![Page 25: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/25.jpg)
Filter options• Proximal gap• Poor mapping• Triallelic site• Strand bias• Clustered position• Observed in Control• Panel of normal samples
27
Good BadJia et al. PLoS ONE 7(6): e38470
Strand bias
![Page 26: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/26.jpg)
Installing MuTect
• Installation (Linux)• Version 1.1.4 available for download at
http://www.broadinstitute.org/cancer/cga/mutect_download (must register an account at Broad)
• Can also be built from source available for download at http://www.nature.com/nbt/journal/v31/n3/extref/nbt.2514-S3.zip 28
![Page 27: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/27.jpg)
Preparing input• Resources:
• COSMIC VCF file: use b37_cosmic_v54_120711.vcf • dbSNP VCF file: use dbsnp_132_b37.leftAligned.vcf.gz• Human reference fasta: downloaded from GATK
reference bundle, use Homo_sapiens_assembly19.fasta, *.fai, *.dict files
• Inputs:• Tumor bam file and matched normal bam file from
read alignment tool output (e.g. BWA, Tophat)• Bam files needed to be sorted and indexed. • Recommendation: corrected for local indels
realignment, marked for PCR duplicates according to GATK best practice variant detection
29
![Page 28: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/28.jpg)
java -Xmx4g -jar /scratch/vuongh/mutect_latest/muTect-1.1.4.jar \ --analysis_type MuTect \ --reference_sequence /ref/Homo_sapiens_assembly19.fasta \-cosmic /ref/hg19_cosmic_v54_120711.vcf \-dbsnp /ref/dbsnp_132_b37.leftAligned.vcf \--input_file:normal /Huy-RNAseq/1/accepted_hits.sorted.RG.bam \--input_file:tumor /Huy-RNAseq/2/accepted_hits.sorted.RG.bam \--out /out/1_2_cal_stats.out \--vcf /out/1_2_mutation.vcf \-cov /out/1_2_coverage.wig.txt \--enable_extended_output
Running MuTect• Command line with all default parameter
30 Notes:
• Put all resource files (COSMIC, dbSNP and reference fasta) in folder ref• Normal bam file and index in folder 1, turmor bam and index in folder 2. • Output call stats and vcf file of mutation candidates in folder out
![Page 29: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/29.jpg)
Result
• Test data: RNA-seq data from squamous cell lung cancer patients (tumor/normal pair)
• Total run time: 6 hours on 8 Intel Nehalem CPUs (2.4 GHz) and, processed 65.1 million reads per sample
• View the result with Excel
31
![Page 30: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/30.jpg)
Example of Mutect output
32
contig position ref_allele alt_allele t_lod_fstar tumor_f contaminant_lod failure_reasons judgem
ent
1 14470 G A 8.631487 0.272727 -0.096458normal_lod,alt_allele_in_normal,poor_m
apping_region_alternate_allele_mapq REJECT
1 14542 A G 4.993144 0.076923 -0.228097fstar_tumor_lod,possible_contamination,
normal_lod,alt_allele_in_normal REJECT
1 14574 A G 4.82618 0.071429 -0.245647 fstar_tumor_lod,possible_contamination REJECT
1 14653 C T137.96602
6 0.714286 -0.429894 normal_lod,alt_allele_in_normal REJECT
1 14673 G C 5.07638 0.030769 2.317242
fstar_tumor_lod,possible_contamination,alt_allele_in_normal,poor_mapping_regi
on_alternate_allele_mapq REJECT1 139393 G T 8.97833 0.3 -0.087734 KEEP
1 788867 C T 7.335518 0.285714 -0.061414 KEEP
1 1321326 C G 7.495658 0.333333 -0.052641 KEEP1 1498692 T C 6.681093 0.2 -0.087736 KEEP
1 1498813 T C 6.706235 0.166667 -0.105281 KEEP
Keep: 1143 (0.5%) %Reject: 213000 (99.5%)
![Page 31: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/31.jpg)
Distribution of keep versus reject calls
33
• Most reject calls are high allelic fraction sSNV
• Keep most of the low-allelic fraction sSNV
• Mono-clonal ???
Allelic fraction f
Density plot with cutoff threshold = 6.3
dens
ity
![Page 32: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/32.jpg)
Effect Variant annotation Chr Start End Ref Altnonsynonymous
SNVCLSTN1:NM_014944:exon2:c.C163T:p.L55F,CLSTN1:NM_
001009566:exon2:c.C163T:p.L55F, 1 9833381 9833381 G Astopgain SNV MASP2:NM_006610:exon10:c.T1236A:p.C412X, 1 11090294 11090294 A T
nonsynonymous SNV
VPS13D:NM_018156:exon63:c.G11985C:p.L3995F,VPS13D:NM_015378:exon64:c.G12060C:p.L4020F, 1 12475169 12475169 G C
nonsynonymous SNV DHRS3:NM_004753:exon6:c.G852C:p.E284D, 1 12628426 12628426 C G
nonsynonymous SNV RSC1A1:NM_006511:exon1:c.C1741T:p.L581F, 1 15988104 15988104 C T
nonsynonymous SNV
RAP1GAP:NM_001145657:exon9:c.T297A:p.H99Q,RAP1GAP:NM_001145658:exon8:c.T489A:p.H163Q,RAP1GAP:
NM_002885:exon8:c.T297A:p.H99Q, 1 21940577 21940577 A Tstopgain SNV HSPG2:NM_005529:exon41:c.C5053T:p.R1685X, 1 22186457 22186457 G A
nonsynonymous SNV RPL11:NM_000975:exon2:c.C7G:p.Q3E, 1 24019099 24019099 C G
nonsynonymous SNV RPL11:NM_000975:exon2:c.A8C:p.Q3P, 1 24019100 24019100 A C
synonymous SNVRPS6KA1:NM_002953:exon22:c.G2207A:p.X736X,RPS6K
A1:NM_001006665:exon21:c.G2234A:p.X745X, 1 26900691 26900691 G A
34
Variant annotation (Annovar)
Display 10 out of 432 genes
![Page 33: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/33.jpg)
Summary• MuTect is a highly sensitive and specific tool
for somatic SNVs calling• Designed to detect low allelic fraction somatic
mutations in as few as 10% of cancer cells• Easy to install and run on all OS• Work on all NGS data• Limitations:
• Computational intensive• Can’t call indels
35
![Page 34: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013](https://reader035.vdocument.in/reader035/viewer/2022062521/5681692d550346895de07147/html5/thumbnails/34.jpg)
THANK YOU
36