review article a survey of computational tools to analyze...
TRANSCRIPT
![Page 1: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/1.jpg)
Review ArticleA Survey of Computational Tools to Analyze and InterpretWhole Exome Sequencing Data
Jennifer D Hintzsche1 William A Robinson12 and Aik Choon Tan123
1Division of Medical Oncology Department of Medicine School of Medicine Aurora CO 80045 USA2University of Colorado Cancer Center Aurora CO 80045 USA3Department of Biostatistics and Informatics Colorado School of Public Health University of ColoradoAnschutz Medical Campus Aurora CO 80045 USA
Correspondence should be addressed to Aik Choon Tan aikchoontanucdenveredu
Received 25 May 2016 Accepted 26 October 2016
Academic Editor Lam C Tsoi
Copyright copy 2016 Jennifer D Hintzsche et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome andis becoming a standard approach in studying genetic variants in diseases Understanding the exomes of individuals at single baseresolution allows the identification of actionable mutations for disease treatment and management WES technologies have shiftedthe bottleneck in experimental data production to computationally intensive informatics-based data analysis Novel computationaltools and methods have been developed to analyze and interpret WES data Here we review some of the current tools that arebeing used to analyze WES data These tools range from the alignment of raw sequencing reads all the way to linking variants toactionable therapeutics Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make moreinformative decisions on selecting the best tools to analyze their WES data
1 Introduction
Recent advances in next-generation sequencing technolo-gies provide revolutionary opportunities to characterize thegenomic landscapes of individuals at single base resolutionfor identifying actionable mutations for disease treatmentand management [1 2] Whole Exome Sequencing (WES) isthe application of the next-generation technology to deter-mine the variations in the exome that is all coding regionsof known genes in a genome For example more than 85 ofdisease-causing mutations in Mendelian diseases are foundin the exome and WES provides an unbiased approach todetect these variants in the era of personalized and precisionmedicine Next-generation sequencing technologies haveshifted the bottleneck in experimental data production tocomputationally intensive informatics-based data analysisFor example the Exome Aggregation Consortium (ExAC)has assembled and reanalyzed WES data of 60706 unre-lated individuals from various disease-specific and popu-lation genetic studies [3] To gain insights in WES novel
computational algorithms and bioinformatics methods rep-resent a critical component in modern biomedical researchto analyze and interpret these massive datasets
Genomic studies that employ WES have increased overthe years and new bioinformatics methods and computa-tional tools have developed to assist the analysis and interpre-tation of this data (Figure 1) The majority of WES computa-tional tools are centered on the generation of aVariantCallingFormat (VCF) file from raw sequencing data Once the VCFfiles have been generated further downstream analyses canbe performed by other computational methods Thereforein this review we have classified bioinformatics methodsand computational tools into Pre-VCF and Post-VCF cate-gories Pre-VCF workflows include tools for aligning the rawsequencing reads to a reference genome variant detectionand annotation Post-VCF workflows include methods forsomatic mutation detection pathway analysis copy num-ber alterations INDEL identification and driver predictionDepending on the nature of the hypothesis beyond VCF
Hindawi Publishing CorporationInternational Journal of GenomicsVolume 2016 Article ID 7983236 16 pageshttpdxdoiorg10115520167983236
2 International Journal of GenomicsN
umbe
r of p
ublic
atio
ns in
Pub
Med
0
200
400
600
800
1000
1200
1400
2011 2012 2013 2014 2015 520162010(Year)
30 4
12426
297
49
574
108192 238
105
730
930
1255
Whole Exome Sequencing (WES) StudiesWES Computational Tools
Figure 1 Trends in Whole Exome Sequencing studies and tools byquerying PubMed (2011ndash2016)
analysis can also includemethods that link variants to clinicaldata as well as potential therapeutics (Figure 2)
Computational tools developed to align raw sequencingdata to an annotated VCF file have been well establishedMost studies tend to follow workflows associated with GATK[4ndash6] SAMtools [7] or a combination of these In generalworkflows start with aligning WES reads to a referencegenome and noting reads that vary The most commonof these variants are single nucleotide variants (SNVs) butalso include insertions deletions and rearrangements Thelocation of these variants is used to annotate them to a specificgene After annotation the SNVs found can be compared todatabases of SNVs found in other studies This allows for thedetermination of frequency of a particular SNV in a givenpopulation In some studies such as those relating to cancerrare somaticmutations are of interest However inMendelianstudies the germline mutational landscape will be of moreinterest than somatic mutations Before a final VCF file isproduced for a given sample software can be used to predictif the variant will be functionally damaging to the protein forprioritizing candidate genes for further study
Bioinformatics methods developed beyond the establish-ment of annotated VCF files are far less established In cancerresearch the most established types of beyond VCF toolsare focused on the detection of somatic mutations Howeverthere are strides being made to develop other computationaltools including pathway analysis copy number alterationINDEL identification driver mutation predictions and link-ing candidate genes to clinical data and actionable targets
Here we will review recent computational tools in theanalysis and interpretation of WES data with special focuson the applications of these methods in cancer researchWe have surveyed the current trends in next-generationsequencing analysis tools and compared their methodologyso that researches can better determine which tools are thebest for their WES study and the advancement of precisionmedicine In addition we include a list of publicly available
bioinformatics and computational tools as a reference forWES studies (Table 1)
2 Computational Tools in Pre-VCF Analyses
Alignment removal of duplicates variant calling annotationfiltration and prediction are all parts of the steps leading upto the generation of a filtered and annotated VCF file Herewe review each one of these steps as shown in Figure 2 andcompare and contrast some of the tools that can be used toperform the Pre-VCF analysis steps
21 Alignment Tools The first step in any analysis of next-generation sequencing is to align the sequencing reads to areference genomeThe twomost common reference genomesfor humans currently are hg18 and hg19 Several aligningalgorithms have been developed including but not limited toBWA [8] Bowtie 1 [9] and 2 [10] GEM [11] ELAND (Illu-mina Inc) GSNAP [12] MAQ [13] mrFAST [14] Novoalign(httpwwwnovocraftcom) SOAP 1 [15] and 2 [16] SSAHA[17] Stampy [18] and YOABS [19] Each method has its ownunique features and many papers have reviewed the differ-ences between them [20ndash22] and we will not review thesetools in depth here The three most commonly used of thesealgorithms are BWA Bowtie (1 and 2) and SOAP (1 and 2)
22 Auxiliary Tools Some auxiliary tools have been devel-oped to filter aligned reads to ensure higher quality data fordownstream analyses PCR amplification can introduceduplicate reads of paired-end reads in sequencing dataTheseduplicate reads can influence the depth of the mapped readsanddownstreamanalyses For example if a variant is detectedin duplicate reads the proportion of reads containing avariant could pass the threshold needed for variant callingthus calling a falsely positive variant Therefore removingduplicate reads is a crucial step in accurately representingthe sequencing depth during downstream analyses Severaltools have been developed to detect PCR duplicates includingPicard (httppicardsourceforgenet) FastUniq [23] andSAMtools [7] SAMtools rmdup finds reads that start andend at the same position find the read with the highestquality score and mark the rest of the duplicates for removalPicard finds identical 51015840 positions for both reads in a matepair and marks them as duplicates In contrast FastUniqtakes a de novo approach to quickly identify PCR duplicatesFastUniq imports all reads sorts them according to theirlocation and thenmarks duplicatesThis allows FastUniq notto require complete genome sequences as prerequisites Dueto the different algorithms each of these tools use these toolscan remove PCR duplicates individually or in combination
23 Methods for Single Nucleotide Variants (SNVs) CallingAfter sequences have been aligned to the reference genomethe next step is to perform variant detection in theWES dataThere are four general categories of variant calling strategiesgermline variants somatic variants copy number variationsand structural variants Multiple tools that perform oneor more of these variant calling techniques were recentlycompared to each other [24] Some common SNV calling
International Journal of Genomics 3
Pre-VCF analyses Pre-VCF analyses
Alignment tools
mrFAST
Auxiliary tools
SNV amp SV calling toolsFor example
GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM
VCF annotation tools
Database filtrationFor example
Genome Project dbSNP LOVD COSMIC
Tools for functional variants prediction For example CADD
MetaLR MetaSVM
Somatic mutationdetection toolsFor example
MuTect VarSimSomVarIUS
Driver prediction tools
Copy number alteration toolsFor example
ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC
VarScan2 ExomeAI
Pathway resourcesFor example
For exampleFor exampleFor example
Pathway analysis toolsFor example
SNPsea DAVID
WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud
TherapeuticPrediction
Biologicalinsights
precisionmedicine
amp
Wholeexome
sequencingdata
(Fastq)SNV amp SV calling
VCFVCF
annotationFunctionalprediction Somaticmutations
DriverpredictionCopy
numberalterationsPathwayanalysis
Alignment
Therapeutic resources
Genome ClinVarPharmGKBDrugBank DSigDB
For example My Cancer
For example BWABOWTIE ELAND GEM GNSAP MAQ
Novoalign SOAP SSHA STAMPY YOBAS
SomaticSniperCNV-seq SegSeq 1000
FreeBayes
CNVseeqer
FastUniqPicard SAMtools
ANNOVARMuTect SnpEffSnpSift VAT
FATHMM LRT
PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix
KEGG STRING BEReX
DAPPLE
Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine
programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)
24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate
Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata
FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]
Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs
Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of
4 International Journal of Genomics
Table1
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Alignm
enttools
Burrow
s-Wheeler
Aligner
(BWA)
Perfo
rmshortreads
alignm
entu
singBW
Tapproach
againsta
references
geno
mea
llowingforg
apsmism
atches
httpbio-bw
asou
rceforgenet
[8]
Bowtie
(1amp2)
Perfo
rmssho
rtread
alignm
entu
singtheB
urrows-W
heele
rind
exin
ordertobe
mem
oryeffi
cientwhilestillmaintaining
analignm
ent
speedof
over
25million35
bpreadsp
erho
ur
httpbo
wtie
-bioso
urceforgen
etin
dexshtm
l[910]
ELAND
Shortreadalignerthatachievesspeed
bysplittin
greadsintoequal
leng
thsa
ndapplying
seed
templates
togu
aranteeh
itswith
only2
mism
atches
httpwwwillum
inac
om
Illum
inaInc
GEM
Shortreadaligneru
singstr
ingmatchinginste
adof
BWTto
deliver
precision
andspeed
httpalgorithm
scnagcatw
ikiTh
eGEM
library
[11]
GSN
AP
Perfo
rmssho
rtandlong
read
alignm
entdetectslon
gandshort
distance
splicingSN
Psand
iscapableo
fdetectin
gbisulfite-tr
eated
DNAform
ethylatio
nstu
dies
httpresearch-pub
genec
omgmap
[12]
MAQ
Shortreadalignerc
ompatib
lewith
Illum
ina-Solexa
andABI
SOLiD
dataperform
sung
appedalignm
entallo
wing2-3mism
atches
for
single-endreadsa
ndon
emism
atch
forp
aired-endreads
httpmaqso
urceforgen
et
[13]
mrFAST
Perfo
rmssho
rtread
alignm
entallo
wingforINDEL
supto
8bpfor
Illum
inag
enerated
dataPaired-endmapping
usingao
neend
anchored
algorithm
allowsfor
detectionof
novelinsertio
ns
httpmrfa
stsourceforgen
et
[14]
Novoalign
Alignm
entd
oneo
npaire
d-endor
single-endsequ
encesalso
capable
ofdo
ingmethylationstu
diesA
llowsfor
amism
atch
upto
50of
aread
leng
thandhasb
uilt-in
adaptera
ndbase
quality
trim
ming
httpwwwno
vocraft
comprodu
ctsno
voalign
httpwww
novocraftcom
SOAP(1amp2)
SOAP2
improved
speedby
anordero
fmagnitude
over
SOAP1
and
canalignaw
ider
ange
ofread
leng
thsa
tthe
speedof
2minutes
for
onem
illionsin
gle-endreadsu
singatwo-way
BWTalgorithm
httpsoapgenom
icso
rgcn
[1516]
SSAHA
Usesa
hashingalgorithm
tofin
dexacto
rclose
toexactm
atchingin
DNAandproteindatabasesanalogou
stodo
ingaB
LAST
search
for
each
read
httpsw
wwvectorbaseorgglossary
ssaha-sequ
ence-search-and-alignm
ent-h
ashing
-algorith
m
[17]
Stam
pyAlignm
entd
oneu
singah
ashing
algorithm
andsta
tistic
almod
elto
alignIllum
inar
eads
forg
enom
eRN
Aand
Chip
sequ
encing
allowing
fora
largen
umbero
rvariatio
nsinclu
ding
insertions
anddeletio
ns
httpwwwwelloxacukproject-s
tampy
[18]
International Journal of Genomics 5Ta
ble1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
YOABS
Usesa
0(119899)a
lgorith
mthatuses
both
hash
andtri-b
ased
metho
dsthat
aree
ffectiveinaligning
sequ
enceso
ver2
00bp
with
3tim
esless
mem
oryandtentim
esfaste
rthanSSAHA
Availableb
yrequ
estfor
noncom
mercialuse
[19]
HTS
eqPy
thon
basedpackagew
ithmanyfunctio
nsto
facilitates
everal
aspectso
fsequencingstu
dies
httpwww-hub
erem
bldeHTS
eqdocoverviewhtml
Auxiliary
tools
FastU
niq
Impo
rtssortsandidentifi
esPC
Rdu
plicates
ofshortsequences
from
sequ
encing
data
httpssou
rceforgenetprojectsfastu
niq
[23]
Picard
Picard
isas
etof
commandlin
etoo
lsform
anipulating
high
-throug
hput
sequ
encing
(HTS
)dataa
ndform
atssuchas
SAMBAMCRA
MandVC
Fhttppicardso
urceforgen
et
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
SNVandSV
calling
GAT
KVa
riant
calling
ofSN
Psandsm
allINDEL
scanalso
beused
onno
nhum
anandno
ndiploid
organism
shttpsw
wwbroadinstituteo
rggatk
[4ndash6
]
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
VCMM
Detectio
nof
SNVsa
ndIN
DEL
susin
gthem
ultin
omialprobabilistic
metho
din
WES
andWGSdata
httpem
usrcr
ikenjpVCM
M
[25]
FreeBa
yes
Detectio
nof
SNPsM
NPsINDEL
sandstr
ucturalvariants(SV
s)fro
msequ
encing
alignm
entsusingBa
yesia
nsta
tisticalmetho
ds
httpsgith
ubco
mekgfreebayes
[27]
indelM
INER
Splitread
algorithm
toidentifybreakp
oint
inIN
DEL
sfrom
paire
d-endsequ
encing
data
httpsgith
ubco
maakroshin
delM
INER
[32]
Pind
elDetectio
nof
INDEL
susin
gap
attern
grow
thapproach
with
anchor
pointsto
providen
ucleotide-levelresolution
httpgm
tgenom
ewustledupackagespindel
[30]
Platypus
Detectio
nof
SNPsM
NPsINDEL
sreplacem
ents
andstructural
varia
nts(SV
s)fro
msequ
encing
alignm
entsusinglocalrealignm
ent
andlocalassem
blyto
achieveh
ighspecificityandsensitivity
httpwwwwelloxacukplatypus
[26]
Splitread
Detectio
nof
INDEL
slessthan50
bplong
from
WES
orWGSdata
usingas
plit-read
algorithm
httpsplitreadso
urceforgen
et
[31]
Sprites
Detectio
nof
INDEL
sisd
oneu
singas
plit-read
andsoft-clipp
ing
approach
thatisespeciallysensitive
indatasetswith
lowcoverage
httpsgith
ubco
mzhang
zhensprites
[33]
VCFannotatio
n
ANNOVA
RProvides
up-to
-datea
nnotationof
VCFfiles
bygeneregion
and
filtersfro
mseveralother
databases
httpanno
varo
penb
ioinform
aticso
rg
[34]
MuT
ect
Postp
rocesses
varia
ntstoeliminatea
rtifa
ctsfrom
hybrid
capture
shortreadalignm
entandnext-generationsequ
encing
httpwwwbroadinstituteo
rgcancercgamutect
[35]
SnpE
ffUses3
800
0geno
mes
topredictand
anno
tatethee
ffectso
fvariants
ongenes
httpsnpeffsourceforgen
et
[36]
SnpSift
ToolstomanipulateV
CFfiles
inclu
ding
filterin
ganno
tatio
ncase
controls
transition
andtransversio
nratesa
ndmore
httpsnpeffsourceforgen
etSnp
Sifthtml
[37]
VAT
Ann
otationof
varia
ntsb
yfunctio
nalityin
acloud
compu
ting
environm
ent
httpvatgerste
inlaborg
[38]
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 2: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/2.jpg)
2 International Journal of GenomicsN
umbe
r of p
ublic
atio
ns in
Pub
Med
0
200
400
600
800
1000
1200
1400
2011 2012 2013 2014 2015 520162010(Year)
30 4
12426
297
49
574
108192 238
105
730
930
1255
Whole Exome Sequencing (WES) StudiesWES Computational Tools
Figure 1 Trends in Whole Exome Sequencing studies and tools byquerying PubMed (2011ndash2016)
analysis can also includemethods that link variants to clinicaldata as well as potential therapeutics (Figure 2)
Computational tools developed to align raw sequencingdata to an annotated VCF file have been well establishedMost studies tend to follow workflows associated with GATK[4ndash6] SAMtools [7] or a combination of these In generalworkflows start with aligning WES reads to a referencegenome and noting reads that vary The most commonof these variants are single nucleotide variants (SNVs) butalso include insertions deletions and rearrangements Thelocation of these variants is used to annotate them to a specificgene After annotation the SNVs found can be compared todatabases of SNVs found in other studies This allows for thedetermination of frequency of a particular SNV in a givenpopulation In some studies such as those relating to cancerrare somaticmutations are of interest However inMendelianstudies the germline mutational landscape will be of moreinterest than somatic mutations Before a final VCF file isproduced for a given sample software can be used to predictif the variant will be functionally damaging to the protein forprioritizing candidate genes for further study
Bioinformatics methods developed beyond the establish-ment of annotated VCF files are far less established In cancerresearch the most established types of beyond VCF toolsare focused on the detection of somatic mutations Howeverthere are strides being made to develop other computationaltools including pathway analysis copy number alterationINDEL identification driver mutation predictions and link-ing candidate genes to clinical data and actionable targets
Here we will review recent computational tools in theanalysis and interpretation of WES data with special focuson the applications of these methods in cancer researchWe have surveyed the current trends in next-generationsequencing analysis tools and compared their methodologyso that researches can better determine which tools are thebest for their WES study and the advancement of precisionmedicine In addition we include a list of publicly available
bioinformatics and computational tools as a reference forWES studies (Table 1)
2 Computational Tools in Pre-VCF Analyses
Alignment removal of duplicates variant calling annotationfiltration and prediction are all parts of the steps leading upto the generation of a filtered and annotated VCF file Herewe review each one of these steps as shown in Figure 2 andcompare and contrast some of the tools that can be used toperform the Pre-VCF analysis steps
21 Alignment Tools The first step in any analysis of next-generation sequencing is to align the sequencing reads to areference genomeThe twomost common reference genomesfor humans currently are hg18 and hg19 Several aligningalgorithms have been developed including but not limited toBWA [8] Bowtie 1 [9] and 2 [10] GEM [11] ELAND (Illu-mina Inc) GSNAP [12] MAQ [13] mrFAST [14] Novoalign(httpwwwnovocraftcom) SOAP 1 [15] and 2 [16] SSAHA[17] Stampy [18] and YOABS [19] Each method has its ownunique features and many papers have reviewed the differ-ences between them [20ndash22] and we will not review thesetools in depth here The three most commonly used of thesealgorithms are BWA Bowtie (1 and 2) and SOAP (1 and 2)
22 Auxiliary Tools Some auxiliary tools have been devel-oped to filter aligned reads to ensure higher quality data fordownstream analyses PCR amplification can introduceduplicate reads of paired-end reads in sequencing dataTheseduplicate reads can influence the depth of the mapped readsanddownstreamanalyses For example if a variant is detectedin duplicate reads the proportion of reads containing avariant could pass the threshold needed for variant callingthus calling a falsely positive variant Therefore removingduplicate reads is a crucial step in accurately representingthe sequencing depth during downstream analyses Severaltools have been developed to detect PCR duplicates includingPicard (httppicardsourceforgenet) FastUniq [23] andSAMtools [7] SAMtools rmdup finds reads that start andend at the same position find the read with the highestquality score and mark the rest of the duplicates for removalPicard finds identical 51015840 positions for both reads in a matepair and marks them as duplicates In contrast FastUniqtakes a de novo approach to quickly identify PCR duplicatesFastUniq imports all reads sorts them according to theirlocation and thenmarks duplicatesThis allows FastUniq notto require complete genome sequences as prerequisites Dueto the different algorithms each of these tools use these toolscan remove PCR duplicates individually or in combination
23 Methods for Single Nucleotide Variants (SNVs) CallingAfter sequences have been aligned to the reference genomethe next step is to perform variant detection in theWES dataThere are four general categories of variant calling strategiesgermline variants somatic variants copy number variationsand structural variants Multiple tools that perform oneor more of these variant calling techniques were recentlycompared to each other [24] Some common SNV calling
International Journal of Genomics 3
Pre-VCF analyses Pre-VCF analyses
Alignment tools
mrFAST
Auxiliary tools
SNV amp SV calling toolsFor example
GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM
VCF annotation tools
Database filtrationFor example
Genome Project dbSNP LOVD COSMIC
Tools for functional variants prediction For example CADD
MetaLR MetaSVM
Somatic mutationdetection toolsFor example
MuTect VarSimSomVarIUS
Driver prediction tools
Copy number alteration toolsFor example
ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC
VarScan2 ExomeAI
Pathway resourcesFor example
For exampleFor exampleFor example
Pathway analysis toolsFor example
SNPsea DAVID
WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud
TherapeuticPrediction
Biologicalinsights
precisionmedicine
amp
Wholeexome
sequencingdata
(Fastq)SNV amp SV calling
VCFVCF
annotationFunctionalprediction Somaticmutations
DriverpredictionCopy
numberalterationsPathwayanalysis
Alignment
Therapeutic resources
Genome ClinVarPharmGKBDrugBank DSigDB
For example My Cancer
For example BWABOWTIE ELAND GEM GNSAP MAQ
Novoalign SOAP SSHA STAMPY YOBAS
SomaticSniperCNV-seq SegSeq 1000
FreeBayes
CNVseeqer
FastUniqPicard SAMtools
ANNOVARMuTect SnpEffSnpSift VAT
FATHMM LRT
PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix
KEGG STRING BEReX
DAPPLE
Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine
programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)
24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate
Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata
FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]
Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs
Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of
4 International Journal of Genomics
Table1
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Alignm
enttools
Burrow
s-Wheeler
Aligner
(BWA)
Perfo
rmshortreads
alignm
entu
singBW
Tapproach
againsta
references
geno
mea
llowingforg
apsmism
atches
httpbio-bw
asou
rceforgenet
[8]
Bowtie
(1amp2)
Perfo
rmssho
rtread
alignm
entu
singtheB
urrows-W
heele
rind
exin
ordertobe
mem
oryeffi
cientwhilestillmaintaining
analignm
ent
speedof
over
25million35
bpreadsp
erho
ur
httpbo
wtie
-bioso
urceforgen
etin
dexshtm
l[910]
ELAND
Shortreadalignerthatachievesspeed
bysplittin
greadsintoequal
leng
thsa
ndapplying
seed
templates
togu
aranteeh
itswith
only2
mism
atches
httpwwwillum
inac
om
Illum
inaInc
GEM
Shortreadaligneru
singstr
ingmatchinginste
adof
BWTto
deliver
precision
andspeed
httpalgorithm
scnagcatw
ikiTh
eGEM
library
[11]
GSN
AP
Perfo
rmssho
rtandlong
read
alignm
entdetectslon
gandshort
distance
splicingSN
Psand
iscapableo
fdetectin
gbisulfite-tr
eated
DNAform
ethylatio
nstu
dies
httpresearch-pub
genec
omgmap
[12]
MAQ
Shortreadalignerc
ompatib
lewith
Illum
ina-Solexa
andABI
SOLiD
dataperform
sung
appedalignm
entallo
wing2-3mism
atches
for
single-endreadsa
ndon
emism
atch
forp
aired-endreads
httpmaqso
urceforgen
et
[13]
mrFAST
Perfo
rmssho
rtread
alignm
entallo
wingforINDEL
supto
8bpfor
Illum
inag
enerated
dataPaired-endmapping
usingao
neend
anchored
algorithm
allowsfor
detectionof
novelinsertio
ns
httpmrfa
stsourceforgen
et
[14]
Novoalign
Alignm
entd
oneo
npaire
d-endor
single-endsequ
encesalso
capable
ofdo
ingmethylationstu
diesA
llowsfor
amism
atch
upto
50of
aread
leng
thandhasb
uilt-in
adaptera
ndbase
quality
trim
ming
httpwwwno
vocraft
comprodu
ctsno
voalign
httpwww
novocraftcom
SOAP(1amp2)
SOAP2
improved
speedby
anordero
fmagnitude
over
SOAP1
and
canalignaw
ider
ange
ofread
leng
thsa
tthe
speedof
2minutes
for
onem
illionsin
gle-endreadsu
singatwo-way
BWTalgorithm
httpsoapgenom
icso
rgcn
[1516]
SSAHA
Usesa
hashingalgorithm
tofin
dexacto
rclose
toexactm
atchingin
DNAandproteindatabasesanalogou
stodo
ingaB
LAST
search
for
each
read
httpsw
wwvectorbaseorgglossary
ssaha-sequ
ence-search-and-alignm
ent-h
ashing
-algorith
m
[17]
Stam
pyAlignm
entd
oneu
singah
ashing
algorithm
andsta
tistic
almod
elto
alignIllum
inar
eads
forg
enom
eRN
Aand
Chip
sequ
encing
allowing
fora
largen
umbero
rvariatio
nsinclu
ding
insertions
anddeletio
ns
httpwwwwelloxacukproject-s
tampy
[18]
International Journal of Genomics 5Ta
ble1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
YOABS
Usesa
0(119899)a
lgorith
mthatuses
both
hash
andtri-b
ased
metho
dsthat
aree
ffectiveinaligning
sequ
enceso
ver2
00bp
with
3tim
esless
mem
oryandtentim
esfaste
rthanSSAHA
Availableb
yrequ
estfor
noncom
mercialuse
[19]
HTS
eqPy
thon
basedpackagew
ithmanyfunctio
nsto
facilitates
everal
aspectso
fsequencingstu
dies
httpwww-hub
erem
bldeHTS
eqdocoverviewhtml
Auxiliary
tools
FastU
niq
Impo
rtssortsandidentifi
esPC
Rdu
plicates
ofshortsequences
from
sequ
encing
data
httpssou
rceforgenetprojectsfastu
niq
[23]
Picard
Picard
isas
etof
commandlin
etoo
lsform
anipulating
high
-throug
hput
sequ
encing
(HTS
)dataa
ndform
atssuchas
SAMBAMCRA
MandVC
Fhttppicardso
urceforgen
et
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
SNVandSV
calling
GAT
KVa
riant
calling
ofSN
Psandsm
allINDEL
scanalso
beused
onno
nhum
anandno
ndiploid
organism
shttpsw
wwbroadinstituteo
rggatk
[4ndash6
]
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
VCMM
Detectio
nof
SNVsa
ndIN
DEL
susin
gthem
ultin
omialprobabilistic
metho
din
WES
andWGSdata
httpem
usrcr
ikenjpVCM
M
[25]
FreeBa
yes
Detectio
nof
SNPsM
NPsINDEL
sandstr
ucturalvariants(SV
s)fro
msequ
encing
alignm
entsusingBa
yesia
nsta
tisticalmetho
ds
httpsgith
ubco
mekgfreebayes
[27]
indelM
INER
Splitread
algorithm
toidentifybreakp
oint
inIN
DEL
sfrom
paire
d-endsequ
encing
data
httpsgith
ubco
maakroshin
delM
INER
[32]
Pind
elDetectio
nof
INDEL
susin
gap
attern
grow
thapproach
with
anchor
pointsto
providen
ucleotide-levelresolution
httpgm
tgenom
ewustledupackagespindel
[30]
Platypus
Detectio
nof
SNPsM
NPsINDEL
sreplacem
ents
andstructural
varia
nts(SV
s)fro
msequ
encing
alignm
entsusinglocalrealignm
ent
andlocalassem
blyto
achieveh
ighspecificityandsensitivity
httpwwwwelloxacukplatypus
[26]
Splitread
Detectio
nof
INDEL
slessthan50
bplong
from
WES
orWGSdata
usingas
plit-read
algorithm
httpsplitreadso
urceforgen
et
[31]
Sprites
Detectio
nof
INDEL
sisd
oneu
singas
plit-read
andsoft-clipp
ing
approach
thatisespeciallysensitive
indatasetswith
lowcoverage
httpsgith
ubco
mzhang
zhensprites
[33]
VCFannotatio
n
ANNOVA
RProvides
up-to
-datea
nnotationof
VCFfiles
bygeneregion
and
filtersfro
mseveralother
databases
httpanno
varo
penb
ioinform
aticso
rg
[34]
MuT
ect
Postp
rocesses
varia
ntstoeliminatea
rtifa
ctsfrom
hybrid
capture
shortreadalignm
entandnext-generationsequ
encing
httpwwwbroadinstituteo
rgcancercgamutect
[35]
SnpE
ffUses3
800
0geno
mes
topredictand
anno
tatethee
ffectso
fvariants
ongenes
httpsnpeffsourceforgen
et
[36]
SnpSift
ToolstomanipulateV
CFfiles
inclu
ding
filterin
ganno
tatio
ncase
controls
transition
andtransversio
nratesa
ndmore
httpsnpeffsourceforgen
etSnp
Sifthtml
[37]
VAT
Ann
otationof
varia
ntsb
yfunctio
nalityin
acloud
compu
ting
environm
ent
httpvatgerste
inlaborg
[38]
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 3: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/3.jpg)
International Journal of Genomics 3
Pre-VCF analyses Pre-VCF analyses
Alignment tools
mrFAST
Auxiliary tools
SNV amp SV calling toolsFor example
GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM
VCF annotation tools
Database filtrationFor example
Genome Project dbSNP LOVD COSMIC
Tools for functional variants prediction For example CADD
MetaLR MetaSVM
Somatic mutationdetection toolsFor example
MuTect VarSimSomVarIUS
Driver prediction tools
Copy number alteration toolsFor example
ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC
VarScan2 ExomeAI
Pathway resourcesFor example
For exampleFor exampleFor example
Pathway analysis toolsFor example
SNPsea DAVID
WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud
TherapeuticPrediction
Biologicalinsights
precisionmedicine
amp
Wholeexome
sequencingdata
(Fastq)SNV amp SV calling
VCFVCF
annotationFunctionalprediction Somaticmutations
DriverpredictionCopy
numberalterationsPathwayanalysis
Alignment
Therapeutic resources
Genome ClinVarPharmGKBDrugBank DSigDB
For example My Cancer
For example BWABOWTIE ELAND GEM GNSAP MAQ
Novoalign SOAP SSHA STAMPY YOBAS
SomaticSniperCNV-seq SegSeq 1000
FreeBayes
CNVseeqer
FastUniqPicard SAMtools
ANNOVARMuTect SnpEffSnpSift VAT
FATHMM LRT
PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix
KEGG STRING BEReX
DAPPLE
Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine
programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)
24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate
Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata
FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]
Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs
Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of
4 International Journal of Genomics
Table1
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Alignm
enttools
Burrow
s-Wheeler
Aligner
(BWA)
Perfo
rmshortreads
alignm
entu
singBW
Tapproach
againsta
references
geno
mea
llowingforg
apsmism
atches
httpbio-bw
asou
rceforgenet
[8]
Bowtie
(1amp2)
Perfo
rmssho
rtread
alignm
entu
singtheB
urrows-W
heele
rind
exin
ordertobe
mem
oryeffi
cientwhilestillmaintaining
analignm
ent
speedof
over
25million35
bpreadsp
erho
ur
httpbo
wtie
-bioso
urceforgen
etin
dexshtm
l[910]
ELAND
Shortreadalignerthatachievesspeed
bysplittin
greadsintoequal
leng
thsa
ndapplying
seed
templates
togu
aranteeh
itswith
only2
mism
atches
httpwwwillum
inac
om
Illum
inaInc
GEM
Shortreadaligneru
singstr
ingmatchinginste
adof
BWTto
deliver
precision
andspeed
httpalgorithm
scnagcatw
ikiTh
eGEM
library
[11]
GSN
AP
Perfo
rmssho
rtandlong
read
alignm
entdetectslon
gandshort
distance
splicingSN
Psand
iscapableo
fdetectin
gbisulfite-tr
eated
DNAform
ethylatio
nstu
dies
httpresearch-pub
genec
omgmap
[12]
MAQ
Shortreadalignerc
ompatib
lewith
Illum
ina-Solexa
andABI
SOLiD
dataperform
sung
appedalignm
entallo
wing2-3mism
atches
for
single-endreadsa
ndon
emism
atch
forp
aired-endreads
httpmaqso
urceforgen
et
[13]
mrFAST
Perfo
rmssho
rtread
alignm
entallo
wingforINDEL
supto
8bpfor
Illum
inag
enerated
dataPaired-endmapping
usingao
neend
anchored
algorithm
allowsfor
detectionof
novelinsertio
ns
httpmrfa
stsourceforgen
et
[14]
Novoalign
Alignm
entd
oneo
npaire
d-endor
single-endsequ
encesalso
capable
ofdo
ingmethylationstu
diesA
llowsfor
amism
atch
upto
50of
aread
leng
thandhasb
uilt-in
adaptera
ndbase
quality
trim
ming
httpwwwno
vocraft
comprodu
ctsno
voalign
httpwww
novocraftcom
SOAP(1amp2)
SOAP2
improved
speedby
anordero
fmagnitude
over
SOAP1
and
canalignaw
ider
ange
ofread
leng
thsa
tthe
speedof
2minutes
for
onem
illionsin
gle-endreadsu
singatwo-way
BWTalgorithm
httpsoapgenom
icso
rgcn
[1516]
SSAHA
Usesa
hashingalgorithm
tofin
dexacto
rclose
toexactm
atchingin
DNAandproteindatabasesanalogou
stodo
ingaB
LAST
search
for
each
read
httpsw
wwvectorbaseorgglossary
ssaha-sequ
ence-search-and-alignm
ent-h
ashing
-algorith
m
[17]
Stam
pyAlignm
entd
oneu
singah
ashing
algorithm
andsta
tistic
almod
elto
alignIllum
inar
eads
forg
enom
eRN
Aand
Chip
sequ
encing
allowing
fora
largen
umbero
rvariatio
nsinclu
ding
insertions
anddeletio
ns
httpwwwwelloxacukproject-s
tampy
[18]
International Journal of Genomics 5Ta
ble1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
YOABS
Usesa
0(119899)a
lgorith
mthatuses
both
hash
andtri-b
ased
metho
dsthat
aree
ffectiveinaligning
sequ
enceso
ver2
00bp
with
3tim
esless
mem
oryandtentim
esfaste
rthanSSAHA
Availableb
yrequ
estfor
noncom
mercialuse
[19]
HTS
eqPy
thon
basedpackagew
ithmanyfunctio
nsto
facilitates
everal
aspectso
fsequencingstu
dies
httpwww-hub
erem
bldeHTS
eqdocoverviewhtml
Auxiliary
tools
FastU
niq
Impo
rtssortsandidentifi
esPC
Rdu
plicates
ofshortsequences
from
sequ
encing
data
httpssou
rceforgenetprojectsfastu
niq
[23]
Picard
Picard
isas
etof
commandlin
etoo
lsform
anipulating
high
-throug
hput
sequ
encing
(HTS
)dataa
ndform
atssuchas
SAMBAMCRA
MandVC
Fhttppicardso
urceforgen
et
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
SNVandSV
calling
GAT
KVa
riant
calling
ofSN
Psandsm
allINDEL
scanalso
beused
onno
nhum
anandno
ndiploid
organism
shttpsw
wwbroadinstituteo
rggatk
[4ndash6
]
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
VCMM
Detectio
nof
SNVsa
ndIN
DEL
susin
gthem
ultin
omialprobabilistic
metho
din
WES
andWGSdata
httpem
usrcr
ikenjpVCM
M
[25]
FreeBa
yes
Detectio
nof
SNPsM
NPsINDEL
sandstr
ucturalvariants(SV
s)fro
msequ
encing
alignm
entsusingBa
yesia
nsta
tisticalmetho
ds
httpsgith
ubco
mekgfreebayes
[27]
indelM
INER
Splitread
algorithm
toidentifybreakp
oint
inIN
DEL
sfrom
paire
d-endsequ
encing
data
httpsgith
ubco
maakroshin
delM
INER
[32]
Pind
elDetectio
nof
INDEL
susin
gap
attern
grow
thapproach
with
anchor
pointsto
providen
ucleotide-levelresolution
httpgm
tgenom
ewustledupackagespindel
[30]
Platypus
Detectio
nof
SNPsM
NPsINDEL
sreplacem
ents
andstructural
varia
nts(SV
s)fro
msequ
encing
alignm
entsusinglocalrealignm
ent
andlocalassem
blyto
achieveh
ighspecificityandsensitivity
httpwwwwelloxacukplatypus
[26]
Splitread
Detectio
nof
INDEL
slessthan50
bplong
from
WES
orWGSdata
usingas
plit-read
algorithm
httpsplitreadso
urceforgen
et
[31]
Sprites
Detectio
nof
INDEL
sisd
oneu
singas
plit-read
andsoft-clipp
ing
approach
thatisespeciallysensitive
indatasetswith
lowcoverage
httpsgith
ubco
mzhang
zhensprites
[33]
VCFannotatio
n
ANNOVA
RProvides
up-to
-datea
nnotationof
VCFfiles
bygeneregion
and
filtersfro
mseveralother
databases
httpanno
varo
penb
ioinform
aticso
rg
[34]
MuT
ect
Postp
rocesses
varia
ntstoeliminatea
rtifa
ctsfrom
hybrid
capture
shortreadalignm
entandnext-generationsequ
encing
httpwwwbroadinstituteo
rgcancercgamutect
[35]
SnpE
ffUses3
800
0geno
mes
topredictand
anno
tatethee
ffectso
fvariants
ongenes
httpsnpeffsourceforgen
et
[36]
SnpSift
ToolstomanipulateV
CFfiles
inclu
ding
filterin
ganno
tatio
ncase
controls
transition
andtransversio
nratesa
ndmore
httpsnpeffsourceforgen
etSnp
Sifthtml
[37]
VAT
Ann
otationof
varia
ntsb
yfunctio
nalityin
acloud
compu
ting
environm
ent
httpvatgerste
inlaborg
[38]
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 4: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/4.jpg)
4 International Journal of Genomics
Table1
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Alignm
enttools
Burrow
s-Wheeler
Aligner
(BWA)
Perfo
rmshortreads
alignm
entu
singBW
Tapproach
againsta
references
geno
mea
llowingforg
apsmism
atches
httpbio-bw
asou
rceforgenet
[8]
Bowtie
(1amp2)
Perfo
rmssho
rtread
alignm
entu
singtheB
urrows-W
heele
rind
exin
ordertobe
mem
oryeffi
cientwhilestillmaintaining
analignm
ent
speedof
over
25million35
bpreadsp
erho
ur
httpbo
wtie
-bioso
urceforgen
etin
dexshtm
l[910]
ELAND
Shortreadalignerthatachievesspeed
bysplittin
greadsintoequal
leng
thsa
ndapplying
seed
templates
togu
aranteeh
itswith
only2
mism
atches
httpwwwillum
inac
om
Illum
inaInc
GEM
Shortreadaligneru
singstr
ingmatchinginste
adof
BWTto
deliver
precision
andspeed
httpalgorithm
scnagcatw
ikiTh
eGEM
library
[11]
GSN
AP
Perfo
rmssho
rtandlong
read
alignm
entdetectslon
gandshort
distance
splicingSN
Psand
iscapableo
fdetectin
gbisulfite-tr
eated
DNAform
ethylatio
nstu
dies
httpresearch-pub
genec
omgmap
[12]
MAQ
Shortreadalignerc
ompatib
lewith
Illum
ina-Solexa
andABI
SOLiD
dataperform
sung
appedalignm
entallo
wing2-3mism
atches
for
single-endreadsa
ndon
emism
atch
forp
aired-endreads
httpmaqso
urceforgen
et
[13]
mrFAST
Perfo
rmssho
rtread
alignm
entallo
wingforINDEL
supto
8bpfor
Illum
inag
enerated
dataPaired-endmapping
usingao
neend
anchored
algorithm
allowsfor
detectionof
novelinsertio
ns
httpmrfa
stsourceforgen
et
[14]
Novoalign
Alignm
entd
oneo
npaire
d-endor
single-endsequ
encesalso
capable
ofdo
ingmethylationstu
diesA
llowsfor
amism
atch
upto
50of
aread
leng
thandhasb
uilt-in
adaptera
ndbase
quality
trim
ming
httpwwwno
vocraft
comprodu
ctsno
voalign
httpwww
novocraftcom
SOAP(1amp2)
SOAP2
improved
speedby
anordero
fmagnitude
over
SOAP1
and
canalignaw
ider
ange
ofread
leng
thsa
tthe
speedof
2minutes
for
onem
illionsin
gle-endreadsu
singatwo-way
BWTalgorithm
httpsoapgenom
icso
rgcn
[1516]
SSAHA
Usesa
hashingalgorithm
tofin
dexacto
rclose
toexactm
atchingin
DNAandproteindatabasesanalogou
stodo
ingaB
LAST
search
for
each
read
httpsw
wwvectorbaseorgglossary
ssaha-sequ
ence-search-and-alignm
ent-h
ashing
-algorith
m
[17]
Stam
pyAlignm
entd
oneu
singah
ashing
algorithm
andsta
tistic
almod
elto
alignIllum
inar
eads
forg
enom
eRN
Aand
Chip
sequ
encing
allowing
fora
largen
umbero
rvariatio
nsinclu
ding
insertions
anddeletio
ns
httpwwwwelloxacukproject-s
tampy
[18]
International Journal of Genomics 5Ta
ble1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
YOABS
Usesa
0(119899)a
lgorith
mthatuses
both
hash
andtri-b
ased
metho
dsthat
aree
ffectiveinaligning
sequ
enceso
ver2
00bp
with
3tim
esless
mem
oryandtentim
esfaste
rthanSSAHA
Availableb
yrequ
estfor
noncom
mercialuse
[19]
HTS
eqPy
thon
basedpackagew
ithmanyfunctio
nsto
facilitates
everal
aspectso
fsequencingstu
dies
httpwww-hub
erem
bldeHTS
eqdocoverviewhtml
Auxiliary
tools
FastU
niq
Impo
rtssortsandidentifi
esPC
Rdu
plicates
ofshortsequences
from
sequ
encing
data
httpssou
rceforgenetprojectsfastu
niq
[23]
Picard
Picard
isas
etof
commandlin
etoo
lsform
anipulating
high
-throug
hput
sequ
encing
(HTS
)dataa
ndform
atssuchas
SAMBAMCRA
MandVC
Fhttppicardso
urceforgen
et
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
SNVandSV
calling
GAT
KVa
riant
calling
ofSN
Psandsm
allINDEL
scanalso
beused
onno
nhum
anandno
ndiploid
organism
shttpsw
wwbroadinstituteo
rggatk
[4ndash6
]
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
VCMM
Detectio
nof
SNVsa
ndIN
DEL
susin
gthem
ultin
omialprobabilistic
metho
din
WES
andWGSdata
httpem
usrcr
ikenjpVCM
M
[25]
FreeBa
yes
Detectio
nof
SNPsM
NPsINDEL
sandstr
ucturalvariants(SV
s)fro
msequ
encing
alignm
entsusingBa
yesia
nsta
tisticalmetho
ds
httpsgith
ubco
mekgfreebayes
[27]
indelM
INER
Splitread
algorithm
toidentifybreakp
oint
inIN
DEL
sfrom
paire
d-endsequ
encing
data
httpsgith
ubco
maakroshin
delM
INER
[32]
Pind
elDetectio
nof
INDEL
susin
gap
attern
grow
thapproach
with
anchor
pointsto
providen
ucleotide-levelresolution
httpgm
tgenom
ewustledupackagespindel
[30]
Platypus
Detectio
nof
SNPsM
NPsINDEL
sreplacem
ents
andstructural
varia
nts(SV
s)fro
msequ
encing
alignm
entsusinglocalrealignm
ent
andlocalassem
blyto
achieveh
ighspecificityandsensitivity
httpwwwwelloxacukplatypus
[26]
Splitread
Detectio
nof
INDEL
slessthan50
bplong
from
WES
orWGSdata
usingas
plit-read
algorithm
httpsplitreadso
urceforgen
et
[31]
Sprites
Detectio
nof
INDEL
sisd
oneu
singas
plit-read
andsoft-clipp
ing
approach
thatisespeciallysensitive
indatasetswith
lowcoverage
httpsgith
ubco
mzhang
zhensprites
[33]
VCFannotatio
n
ANNOVA
RProvides
up-to
-datea
nnotationof
VCFfiles
bygeneregion
and
filtersfro
mseveralother
databases
httpanno
varo
penb
ioinform
aticso
rg
[34]
MuT
ect
Postp
rocesses
varia
ntstoeliminatea
rtifa
ctsfrom
hybrid
capture
shortreadalignm
entandnext-generationsequ
encing
httpwwwbroadinstituteo
rgcancercgamutect
[35]
SnpE
ffUses3
800
0geno
mes
topredictand
anno
tatethee
ffectso
fvariants
ongenes
httpsnpeffsourceforgen
et
[36]
SnpSift
ToolstomanipulateV
CFfiles
inclu
ding
filterin
ganno
tatio
ncase
controls
transition
andtransversio
nratesa
ndmore
httpsnpeffsourceforgen
etSnp
Sifthtml
[37]
VAT
Ann
otationof
varia
ntsb
yfunctio
nalityin
acloud
compu
ting
environm
ent
httpvatgerste
inlaborg
[38]
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 5: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/5.jpg)
International Journal of Genomics 5Ta
ble1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
YOABS
Usesa
0(119899)a
lgorith
mthatuses
both
hash
andtri-b
ased
metho
dsthat
aree
ffectiveinaligning
sequ
enceso
ver2
00bp
with
3tim
esless
mem
oryandtentim
esfaste
rthanSSAHA
Availableb
yrequ
estfor
noncom
mercialuse
[19]
HTS
eqPy
thon
basedpackagew
ithmanyfunctio
nsto
facilitates
everal
aspectso
fsequencingstu
dies
httpwww-hub
erem
bldeHTS
eqdocoverviewhtml
Auxiliary
tools
FastU
niq
Impo
rtssortsandidentifi
esPC
Rdu
plicates
ofshortsequences
from
sequ
encing
data
httpssou
rceforgenetprojectsfastu
niq
[23]
Picard
Picard
isas
etof
commandlin
etoo
lsform
anipulating
high
-throug
hput
sequ
encing
(HTS
)dataa
ndform
atssuchas
SAMBAMCRA
MandVC
Fhttppicardso
urceforgen
et
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
SNVandSV
calling
GAT
KVa
riant
calling
ofSN
Psandsm
allINDEL
scanalso
beused
onno
nhum
anandno
ndiploid
organism
shttpsw
wwbroadinstituteo
rggatk
[4ndash6
]
SAMtools
Suite
oftoolsc
apableof
view
ingindexing
editin
gwritingand
readingSA
MB
AMand
CRAM
form
attedfiles
httpwwwhtsliborg
[7]
VCMM
Detectio
nof
SNVsa
ndIN
DEL
susin
gthem
ultin
omialprobabilistic
metho
din
WES
andWGSdata
httpem
usrcr
ikenjpVCM
M
[25]
FreeBa
yes
Detectio
nof
SNPsM
NPsINDEL
sandstr
ucturalvariants(SV
s)fro
msequ
encing
alignm
entsusingBa
yesia
nsta
tisticalmetho
ds
httpsgith
ubco
mekgfreebayes
[27]
indelM
INER
Splitread
algorithm
toidentifybreakp
oint
inIN
DEL
sfrom
paire
d-endsequ
encing
data
httpsgith
ubco
maakroshin
delM
INER
[32]
Pind
elDetectio
nof
INDEL
susin
gap
attern
grow
thapproach
with
anchor
pointsto
providen
ucleotide-levelresolution
httpgm
tgenom
ewustledupackagespindel
[30]
Platypus
Detectio
nof
SNPsM
NPsINDEL
sreplacem
ents
andstructural
varia
nts(SV
s)fro
msequ
encing
alignm
entsusinglocalrealignm
ent
andlocalassem
blyto
achieveh
ighspecificityandsensitivity
httpwwwwelloxacukplatypus
[26]
Splitread
Detectio
nof
INDEL
slessthan50
bplong
from
WES
orWGSdata
usingas
plit-read
algorithm
httpsplitreadso
urceforgen
et
[31]
Sprites
Detectio
nof
INDEL
sisd
oneu
singas
plit-read
andsoft-clipp
ing
approach
thatisespeciallysensitive
indatasetswith
lowcoverage
httpsgith
ubco
mzhang
zhensprites
[33]
VCFannotatio
n
ANNOVA
RProvides
up-to
-datea
nnotationof
VCFfiles
bygeneregion
and
filtersfro
mseveralother
databases
httpanno
varo
penb
ioinform
aticso
rg
[34]
MuT
ect
Postp
rocesses
varia
ntstoeliminatea
rtifa
ctsfrom
hybrid
capture
shortreadalignm
entandnext-generationsequ
encing
httpwwwbroadinstituteo
rgcancercgamutect
[35]
SnpE
ffUses3
800
0geno
mes
topredictand
anno
tatethee
ffectso
fvariants
ongenes
httpsnpeffsourceforgen
et
[36]
SnpSift
ToolstomanipulateV
CFfiles
inclu
ding
filterin
ganno
tatio
ncase
controls
transition
andtransversio
nratesa
ndmore
httpsnpeffsourceforgen
etSnp
Sifthtml
[37]
VAT
Ann
otationof
varia
ntsb
yfunctio
nalityin
acloud
compu
ting
environm
ent
httpvatgerste
inlaborg
[38]
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 6: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/6.jpg)
6 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Databasefi
ltration
1000
Genom
esProject
Genotypeinformationfro
map
opulationof
1000
healthyindividu
als
httpwww1000geno
mesorg
[41]
dbSN
PDatabaseo
fgenom
icvaria
ntsfrom
53organism
shttpsw
wwncbinlm
nihgovprojectsSN
P[39]
LOVD
Opensource
database
offre
elyavailableg
ene-centered
collectionof
DNAvaria
ntsa
ndsto
rage
ofpatie
ntandNGSdata
httpwwwlovdnl3
0hom
e[40]
COSM
ICDatabasec
ontainingsomaticmutations
from
human
cancers
separatedinto
expertcurateddataandgeno
me-wides
creenpu
blish
edin
scientificliterature
httpcancersa
ngerac
ukcosm
ic[42]
NHLB
IGOEx
ome
Sequ
encing
Project(ES
P)Databaseo
fgenes
andmechanism
sthatcon
tributetobloo
dlung
andheartd
isordersthrou
ghNGSdatain
vario
uspo
pulations
httpevsg
swashing
toneduEV
S
Exom
eAggregatio
nCon
sortium
(ExA
C)Databaseo
f60706un
related
individu
alsfrom
diseasea
ndpo
pulation
exom
esequencingstu
dies
httpexacbroadinstituteorg
[3]
SeattleSeqAnn
otation
Partof
theN
HBL
Isequencingprojectthisdatabase
contains
novel
andkn
ownSN
Vsa
ndIN
DEL
sincluding
accessionnu
mberfunctio
nof
thev
ariantand
HapMap
frequ
enciesclin
icalassociation
and
PolyPh
enpredictio
ns
httpsnpgswashing
toneduSeattleSeqA
nnotation137
Functio
nalpredictors
CADD
Machine
learning
algorithm
toscorea
llpo
ssible86million
substitutions
intheh
uman
referenceg
enom
efrom
1to99
basedon
know
nandsim
ulated
functio
nalvariants
httpcadd
gsw
ashing
toneduinfo
[49]
FATH
MM
UsesH
iddenMarkovMod
elstopredictthe
functio
nalcon
sequ
ences
ofSN
Vsincoding
andno
ncod
ingvaria
ntsthrou
ghaw
ebserver
httpfathmmbiocompu
teorguk
[46]
LRT
Usesthe
Likelih
oodRa
tiosta
tistic
altestto
compare
avariant
tokn
ownvaria
ntsa
nddeterm
ineiftheyarep
redicted
tobe
benign
deleterio
usoru
nkno
wn
httpgeno
mec
shlporgcon
tent19
915
53long
[45]
PolyPh
en-2
Predictspo
tentialimpactof
anon
syno
nymou
svariant
using
comparativ
eand
physicalcharacteris
tics
httpgenetic
sbwhharvardedupp
h2
[44]
SIFT
ByusingPS
I-BL
ASTa
predictio
ncanbe
madeo
nthee
ffectof
ano
nsyn
onym
ousm
utationwith
inap
rotein
httpsift
jcviorg
[43]
VES
TMachine
learning
approach
todeterm
inethe
prob
abilitythata
miss
ense
mutationwill
impairthefun
ctionalityof
aprotein
httpkarchinlaborgapp
sappV
esth
tml
[48]
MetaSVM
ampMetaLR
Integrationof
aSup
portVe
ctor
Machine
andLo
gisticRe
gressio
nto
integraten
ined
eleterious
predictio
nscores
ofmiss
ense
mutations
httpssitesg
ooglec
omsitejp
opgendb
NSFP
[47]
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 7: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/7.jpg)
International Journal of Genomics 7
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Significant
somaticmutations
SomaticSn
iper
Usin
gtwobam
files
asinpu
tthistooluses
theg
enotypelikeliho
odmod
elof
MAZto
calculatethe
prob
abilitythatthetum
orandno
rmal
samples
ared
ifferentthus
identifying
somaticvaria
nts
httpgm
tgenom
ewustledupackagessom
atic-sniper
[50]
MuT
ect
Usin
gsta
tistic
alanalysisto
predictthe
likeliho
odof
asom
atic
mutationusingtwoBa
yesia
napproaches
httpsw
wwbroadinstituteo
rgcancercgamutect
[35]
VarSim
Byleveraging
onpreviouslyrepo
rted
mutationsa
rand
ommutation
simulationispreformed
topredictsom
aticmutations
httpbioinformgith
ubio
varsim
[51]
SomVa
rIUS
Identifi
catio
nof
somaticvaria
ntsfrom
unpaire
dtissues
amples
with
asequ
encing
depthof
150x
and67precision
implem
entedin
Python
httpsgith
ubco
mkylessm
ithSom
VarIUS
[52]
Copy
numbera
lteratio
n
Con
trol-F
REEC
Detectscopy
numberc
hanges
andlossof
heterozygosity(LOH)from
paire
dSA
MBAM
files
bycompu
tingandno
rmalizingcopy
number
andbetaallelefre
quency
httpbioinfo-ou
tcuriefrprojectsfre
ec
[59]
CNV-seq
Mappedread
coun
tisc
alculated
over
aslid
ingwindo
win
PerlandR
todeterm
inec
opynu
mberfrom
HTS
studies
httptig
erdbsnused
usgcnv-seq
[53]
SegSeq
Usin
g14
millionalignedsequ
ence
readsfrom
cancer
celllin
esequ
alcopy
numbera
lteratio
nsarec
alculatedfro
msequ
encing
data
httpsw
wwbroadinstituteo
rgcancercgasegseq
[54]
VarScan2
Determines
copy
numberc
hanges
inmatched
orun
matched
samples
usingread
ratio
sand
then
postp
rocessed
with
acirc
ular
binary
segm
entatio
nalgorithm
httpdk
oboldtgith
ubio
varscanusin
g-varscanhtml
[61]
Exom
eAI
Detectsalleleim
balanceincluding
LOHin
unmatched
tumor
samples
usingas
tatistic
alapproach
thatiscapableo
fhandlinglow-quality
datasets
httpgqinno
vatio
ncentercom
indexaspx
[64]
CNVseeqer
Exon
coverage
betweenmatched
sequ
encesw
ascalculated
usinglog 2
ratio
sfollowed
bythec
ircular
binary
segm
entatio
nalgorithm
httpicbmedco
rnelledu
wikiind
exphp
title=
Elem
entolab
CNVseeqerampredirect=n
o[60]
EXCA
VATO
RDetectscopy
numberv
ariantsfrom
WES
datain
3ste
psusinga
HiddenMarkovMod
elalgorithm
httpssou
rceforgenetprojectsexcavatortoo
l[57]
Exom
eCNV
Rpackageu
sedto
detectcopy
numberv
ariantso
flosso
fheterozygosityfro
mWES
data
httpssecureg
enom
eucla
eduindexph
pEx
omeC
NV
UserGuide
[58]
ADTE
xDetectio
nof
aberratio
nsin
tumor
exom
esby
detectingB-allele
frequ
encies
andim
plem
entedin
Rhttpadtexsourceforgen
et
[55]
CONTR
AUsesn
ormalized
depthof
coverage
todetectcopy
numberc
hanges
from
targeted
resequ
encing
datainclu
ding
WES
httpssou
rceforgenetprojectscontra-cnv
[56]
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 8: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/8.jpg)
8 International Journal of Genomics
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
Driv
erpredictiontools
CHASM
Machine
learning
metho
dthatpredictsthefun
ctionalsignificance
ofsomaticmutations
httpkarchinlaborgapp
sappC
hasm
htm
l[65]
Dendrix
Den
ovodriversa
rediscovered
from
cancer
onlymutationald
ata
inclu
ding
genesnu
cleotidesord
omains
thathave
high
exclu
sivity
andcoverage
httpcompb
iocs
browneduprojectsdend
rix
[66]
MutSigC
VGene-specifica
ndpatie
nt-specific
mutationfre
quencies
are
incorporated
tofin
dmutations
ingenesthatare
mutated
moreo
ften
than
wou
ldbe
expected
bychance
httpwwwbroadinstituteo
rgcancersoftw
are
genepatte
rnm
odulesdocsMutSigC
V[67]
Pathwa
yana
lysistoolsa
ndresources
KEGG
Databaseu
singmapso
fkno
wnbiologicalprocessesthatallo
ws
searchingforg
enes
andcolorc
odingof
results
httpwwwgeno
mejpkegg
[68]
DAV
IDAllowsfor
userstoinpu
talarges
etof
genesa
nddiscover
the
functio
nalann
otationof
theg
enelist
inclu
ding
pathwaysgene
ontology
term
sandmore
httpsdavidncifcrfgov
[69]
STRING
Networkvisualizationof
protein-proteininteractions
ofover
2031
organism
shttpstr
ing-db
org
[70]
BERe
XUsesb
iomedicalkn
owledgetoallowuserstosearch
forrelationships
betweenbiom
edicalentities
httpinfosk
oreaac
krberex
[71]
DAPP
LEUsesa
listo
fgenes
todeterm
inep
hysic
alconn
ectiv
ityam
ongproteins
accordingto
protein-proteininteractions
httpjournalsplosorgplosgeneticsarticleid
=101371
journalpgen1001273
[72]
SNPsea
Usesa
linkage
disequ
ilibrium
todeterm
inep
athw
aysa
ndcelltypes
thatarelikely
tobe
affectedbasedon
SNPdata
httpwwwbroadinstituteo
rgm
pgsnp
sea
[73]
Toolsa
ndresourcesfor
linking
varia
ntstotherapeutics
cBioPo
rtal
Databasethatallo
wsthe
downloadanalysis
andvisualizationof
cancer
sequ
encing
studiesincluding
providingpatie
ntandclinical
dataforsam
ples
httpwwwcbiopo
rtalorg
[78]
MyCa
ncer
Genom
eDatabasefor
cancer
research
thatprovides
linkage
ofmutational
status
totherapiesa
ndavailablec
linicaltrials
httpsw
wwmycancergenom
eorg
httpwwwmycan-
cergenom
eorg
ClinVa
rDatabaseo
frelationshipbetweenph
enotypes
andhu
man
varia
tions
show
ingther
elationshipbetweenhealth
statusa
ndhu
man
varia
tions
andkn
ownim
plications
httpsw
wwncbinlm
nihgovclin
var
[74]
DSigD
BDatabaseo
fdrugsig
naturesthatincludes195
31genesa
nd17389
compo
unds
thatcanin
parthelpidentifycompo
unds
ford
rug
repu
rposingstu
dies
intransla
tionalresearch
httptanlabucdenveredu
DSigD
B[77]
PharmGKB
Know
ledgeb
asea
llowingvisualizationof
avarietyof
drug-gene
know
ledge
httpsw
wwph
armgkborg
[75]
DrugB
ank
Con
tainsd
etaileddrug
inform
ationwith
comprehensiv
edrugtarget
inform
ationfor8
206
drugs
httpwwwdrugbank
ca
[76]
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 9: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/9.jpg)
International Journal of Genomics 9
Table1Con
tinued
Com
putatio
naltoo
lsDescriptio
nWebsite
References
WES
data
analysispipelin
es
fast2
VCF
Who
leEx
omeS
equencingpipelin
ethatstartsw
ithrawsequ
encing
(fastq
)files
andends
with
aVCF
filethath
asgood
capabilityforn
ovel
andexpertusers
httpfastq
2vcfso
urceforgen
et
[80]
SeqM
ule
WES
orWGSpipelin
ethatcom
binesthe
inform
ationfro
mover
ten
alignm
entand
analysistoolstoarriv
eata
VCFfilethatcan
beused
inbo
thMendelianandcancer
studies
httpseqm
uleo
penb
ioinform
aticso
rgenlatest
[79]
IMPA
CTWES
dataanalysispipelin
ethatstartsw
ithrawsequ
encing
readsa
ndanalyzes
SNVsa
ndCN
Asandlin
ksthisdatato
alist
ofprioritized
drug
sfrom
clinicaltria
lsandDSigD
Bhttptanlabucdenveredu
IMPA
CT
[81]
Genom
eson
theC
loud
(GotClou
d)
Automated
sequ
encing
pipelin
ethatp
erform
sinpartalignm
ent
varia
ntcalling
and
quality
controlthatcan
berunon
Amazon
Web
Services
EC2as
wellaslocalmachinesa
ndclu
sters
httpgeno
mes
phumicheduwikiG
otClou
d
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 10: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/10.jpg)
10 International Journal of Genomics
structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis
Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far
Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]
All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES
25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]
VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species
26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]
The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]
27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 11: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/11.jpg)
International Journal of Genomics 11
by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]
MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]
The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]
While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results
3 Computational Methods forBeyond VCF Analyses
After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research
31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]
SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants
Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies
While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection
32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 12: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/12.jpg)
12 International Journal of Genomics
a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log
2of the ratio of tumor depth to normal depth
for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]
Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples
A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task
33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]
CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor
types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies
Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies
De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies
34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]
KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks
DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 13: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/13.jpg)
International Journal of Genomics 13
genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions
35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs
Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine
4 WES Analysis Pipelines
WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections
SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data
Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]
To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research
As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions
5 Conclusions
In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 14: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/14.jpg)
14 International Journal of Genomics
precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory
Competing Interests
The authors declare that there are no competing interestsregarding the publication of this paper
Acknowledgments
The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation
References
[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010
[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016
[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016
[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010
[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011
[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013
[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009
[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010
[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009
[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012
[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010
[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008
[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009
[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009
[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001
[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011
[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012
[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010
[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008
[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012
[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012
[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014
[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013
[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014
[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907
[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015
[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015
[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009
[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012
[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 15: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/15.jpg)
International Journal of Genomics 15
[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016
[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010
[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013
[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012
[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012
[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012
[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001
[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005
[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010
[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015
[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009
[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010
[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009
[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013
[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015
[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013
[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014
[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012
[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014
[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015
[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009
[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009
[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014
[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012
[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013
[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011
[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011
[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014
[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012
[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011
[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015
[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014
[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010
[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 16: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/16.jpg)
16 International Journal of Genomics
[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013
[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000
[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009
[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015
[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014
[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011
[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014
[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014
[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012
[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014
[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014
[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012
[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015
[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015
[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016
[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
![Page 17: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application](https://reader034.vdocument.in/reader034/viewer/2022052019/603302c205c83b368008a5ab/html5/thumbnails/17.jpg)
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology