review article a survey of computational tools to analyze...

17
Review Article A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data Jennifer D. Hintzsche, 1 William A. Robinson, 1,2 and Aik Choon Tan 1,2,3 1 Division of Medical Oncology, Department of Medicine, School of Medicine, Aurora, CO 80045, USA 2 University of Colorado Cancer Center, Aurora, CO 80045, USA 3 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado, Anschutz Medical Campus, Aurora, CO 80045, USA Correspondence should be addressed to Aik Choon Tan; [email protected] Received 25 May 2016; Accepted 26 October 2016 Academic Editor: Lam C. Tsoi Copyright © 2016 Jennifer D. Hintzsche et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome and is becoming a standard approach in studying genetic variants in diseases. Understanding the exomes of individuals at single base resolution allows the identification of actionable mutations for disease treatment and management. WES technologies have shiſted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. Novel computational tools and methods have been developed to analyze and interpret WES data. Here, we review some of the current tools that are being used to analyze WES data. ese tools range from the alignment of raw sequencing reads all the way to linking variants to actionable therapeutics. Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make more informative decisions on selecting the best tools to analyze their WES data. 1. Introduction Recent advances in next-generation sequencing technolo- gies provide revolutionary opportunities to characterize the genomic landscapes of individuals at single base resolution for identifying actionable mutations for disease treatment and management [1, 2]. Whole Exome Sequencing (WES) is the application of the next-generation technology to deter- mine the variations in the exome, that is, all coding regions of known genes in a genome. For example, more than 85% of disease-causing mutations in Mendelian diseases are found in the exome, and WES provides an unbiased approach to detect these variants in the era of personalized and precision medicine. Next-generation sequencing technologies have shiſted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. For example, the Exome Aggregation Consortium (ExAC) has assembled and reanalyzed WES data of 60,706 unre- lated individuals from various disease-specific and popu- lation genetic studies [3]. To gain insights in WES, novel computational algorithms and bioinformatics methods rep- resent a critical component in modern biomedical research to analyze and interpret these massive datasets. Genomic studies that employ WES have increased over the years, and new bioinformatics methods and computa- tional tools have developed to assist the analysis and interpre- tation of this data (Figure 1). e majority of WES computa- tional tools are centered on the generation of a Variant Calling Format (VCF) file from raw sequencing data. Once the VCF files have been generated, further downstream analyses can be performed by other computational methods. erefore, in this review we have classified bioinformatics methods and computational tools into Pre-VCF and Post-VCF cate- gories. Pre-VCF workflows include tools for aligning the raw sequencing reads to a reference genome, variant detection, and annotation. Post-VCF workflows include methods for somatic mutation detection, pathway analysis, copy num- ber alterations, INDEL identification, and driver prediction. Depending on the nature of the hypothesis, beyond VCF Hindawi Publishing Corporation International Journal of Genomics Volume 2016, Article ID 7983236, 16 pages http://dx.doi.org/10.1155/2016/7983236

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

Review ArticleA Survey of Computational Tools to Analyze and InterpretWhole Exome Sequencing Data

Jennifer D Hintzsche1 William A Robinson12 and Aik Choon Tan123

1Division of Medical Oncology Department of Medicine School of Medicine Aurora CO 80045 USA2University of Colorado Cancer Center Aurora CO 80045 USA3Department of Biostatistics and Informatics Colorado School of Public Health University of ColoradoAnschutz Medical Campus Aurora CO 80045 USA

Correspondence should be addressed to Aik Choon Tan aikchoontanucdenveredu

Received 25 May 2016 Accepted 26 October 2016

Academic Editor Lam C Tsoi

Copyright copy 2016 Jennifer D Hintzsche et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome andis becoming a standard approach in studying genetic variants in diseases Understanding the exomes of individuals at single baseresolution allows the identification of actionable mutations for disease treatment and management WES technologies have shiftedthe bottleneck in experimental data production to computationally intensive informatics-based data analysis Novel computationaltools and methods have been developed to analyze and interpret WES data Here we review some of the current tools that arebeing used to analyze WES data These tools range from the alignment of raw sequencing reads all the way to linking variants toactionable therapeutics Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make moreinformative decisions on selecting the best tools to analyze their WES data

1 Introduction

Recent advances in next-generation sequencing technolo-gies provide revolutionary opportunities to characterize thegenomic landscapes of individuals at single base resolutionfor identifying actionable mutations for disease treatmentand management [1 2] Whole Exome Sequencing (WES) isthe application of the next-generation technology to deter-mine the variations in the exome that is all coding regionsof known genes in a genome For example more than 85 ofdisease-causing mutations in Mendelian diseases are foundin the exome and WES provides an unbiased approach todetect these variants in the era of personalized and precisionmedicine Next-generation sequencing technologies haveshifted the bottleneck in experimental data production tocomputationally intensive informatics-based data analysisFor example the Exome Aggregation Consortium (ExAC)has assembled and reanalyzed WES data of 60706 unre-lated individuals from various disease-specific and popu-lation genetic studies [3] To gain insights in WES novel

computational algorithms and bioinformatics methods rep-resent a critical component in modern biomedical researchto analyze and interpret these massive datasets

Genomic studies that employ WES have increased overthe years and new bioinformatics methods and computa-tional tools have developed to assist the analysis and interpre-tation of this data (Figure 1) The majority of WES computa-tional tools are centered on the generation of aVariantCallingFormat (VCF) file from raw sequencing data Once the VCFfiles have been generated further downstream analyses canbe performed by other computational methods Thereforein this review we have classified bioinformatics methodsand computational tools into Pre-VCF and Post-VCF cate-gories Pre-VCF workflows include tools for aligning the rawsequencing reads to a reference genome variant detectionand annotation Post-VCF workflows include methods forsomatic mutation detection pathway analysis copy num-ber alterations INDEL identification and driver predictionDepending on the nature of the hypothesis beyond VCF

Hindawi Publishing CorporationInternational Journal of GenomicsVolume 2016 Article ID 7983236 16 pageshttpdxdoiorg10115520167983236

2 International Journal of GenomicsN

umbe

r of p

ublic

atio

ns in

Pub

Med

0

200

400

600

800

1000

1200

1400

2011 2012 2013 2014 2015 520162010(Year)

30 4

12426

297

49

574

108192 238

105

730

930

1255

Whole Exome Sequencing (WES) StudiesWES Computational Tools

Figure 1 Trends in Whole Exome Sequencing studies and tools byquerying PubMed (2011ndash2016)

analysis can also includemethods that link variants to clinicaldata as well as potential therapeutics (Figure 2)

Computational tools developed to align raw sequencingdata to an annotated VCF file have been well establishedMost studies tend to follow workflows associated with GATK[4ndash6] SAMtools [7] or a combination of these In generalworkflows start with aligning WES reads to a referencegenome and noting reads that vary The most commonof these variants are single nucleotide variants (SNVs) butalso include insertions deletions and rearrangements Thelocation of these variants is used to annotate them to a specificgene After annotation the SNVs found can be compared todatabases of SNVs found in other studies This allows for thedetermination of frequency of a particular SNV in a givenpopulation In some studies such as those relating to cancerrare somaticmutations are of interest However inMendelianstudies the germline mutational landscape will be of moreinterest than somatic mutations Before a final VCF file isproduced for a given sample software can be used to predictif the variant will be functionally damaging to the protein forprioritizing candidate genes for further study

Bioinformatics methods developed beyond the establish-ment of annotated VCF files are far less established In cancerresearch the most established types of beyond VCF toolsare focused on the detection of somatic mutations Howeverthere are strides being made to develop other computationaltools including pathway analysis copy number alterationINDEL identification driver mutation predictions and link-ing candidate genes to clinical data and actionable targets

Here we will review recent computational tools in theanalysis and interpretation of WES data with special focuson the applications of these methods in cancer researchWe have surveyed the current trends in next-generationsequencing analysis tools and compared their methodologyso that researches can better determine which tools are thebest for their WES study and the advancement of precisionmedicine In addition we include a list of publicly available

bioinformatics and computational tools as a reference forWES studies (Table 1)

2 Computational Tools in Pre-VCF Analyses

Alignment removal of duplicates variant calling annotationfiltration and prediction are all parts of the steps leading upto the generation of a filtered and annotated VCF file Herewe review each one of these steps as shown in Figure 2 andcompare and contrast some of the tools that can be used toperform the Pre-VCF analysis steps

21 Alignment Tools The first step in any analysis of next-generation sequencing is to align the sequencing reads to areference genomeThe twomost common reference genomesfor humans currently are hg18 and hg19 Several aligningalgorithms have been developed including but not limited toBWA [8] Bowtie 1 [9] and 2 [10] GEM [11] ELAND (Illu-mina Inc) GSNAP [12] MAQ [13] mrFAST [14] Novoalign(httpwwwnovocraftcom) SOAP 1 [15] and 2 [16] SSAHA[17] Stampy [18] and YOABS [19] Each method has its ownunique features and many papers have reviewed the differ-ences between them [20ndash22] and we will not review thesetools in depth here The three most commonly used of thesealgorithms are BWA Bowtie (1 and 2) and SOAP (1 and 2)

22 Auxiliary Tools Some auxiliary tools have been devel-oped to filter aligned reads to ensure higher quality data fordownstream analyses PCR amplification can introduceduplicate reads of paired-end reads in sequencing dataTheseduplicate reads can influence the depth of the mapped readsanddownstreamanalyses For example if a variant is detectedin duplicate reads the proportion of reads containing avariant could pass the threshold needed for variant callingthus calling a falsely positive variant Therefore removingduplicate reads is a crucial step in accurately representingthe sequencing depth during downstream analyses Severaltools have been developed to detect PCR duplicates includingPicard (httppicardsourceforgenet) FastUniq [23] andSAMtools [7] SAMtools rmdup finds reads that start andend at the same position find the read with the highestquality score and mark the rest of the duplicates for removalPicard finds identical 51015840 positions for both reads in a matepair and marks them as duplicates In contrast FastUniqtakes a de novo approach to quickly identify PCR duplicatesFastUniq imports all reads sorts them according to theirlocation and thenmarks duplicatesThis allows FastUniq notto require complete genome sequences as prerequisites Dueto the different algorithms each of these tools use these toolscan remove PCR duplicates individually or in combination

23 Methods for Single Nucleotide Variants (SNVs) CallingAfter sequences have been aligned to the reference genomethe next step is to perform variant detection in theWES dataThere are four general categories of variant calling strategiesgermline variants somatic variants copy number variationsand structural variants Multiple tools that perform oneor more of these variant calling techniques were recentlycompared to each other [24] Some common SNV calling

International Journal of Genomics 3

Pre-VCF analyses Pre-VCF analyses

Alignment tools

mrFAST

Auxiliary tools

SNV amp SV calling toolsFor example

GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM

VCF annotation tools

Database filtrationFor example

Genome Project dbSNP LOVD COSMIC

Tools for functional variants prediction For example CADD

MetaLR MetaSVM

Somatic mutationdetection toolsFor example

MuTect VarSimSomVarIUS

Driver prediction tools

Copy number alteration toolsFor example

ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC

VarScan2 ExomeAI

Pathway resourcesFor example

For exampleFor exampleFor example

Pathway analysis toolsFor example

SNPsea DAVID

WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud

TherapeuticPrediction

Biologicalinsights

precisionmedicine

amp

Wholeexome

sequencingdata

(Fastq)SNV amp SV calling

VCFVCF

annotationFunctionalprediction Somaticmutations

DriverpredictionCopy

numberalterationsPathwayanalysis

Alignment

Therapeutic resources

Genome ClinVarPharmGKBDrugBank DSigDB

For example My Cancer

For example BWABOWTIE ELAND GEM GNSAP MAQ

Novoalign SOAP SSHA STAMPY YOBAS

SomaticSniperCNV-seq SegSeq 1000

FreeBayes

CNVseeqer

FastUniqPicard SAMtools

ANNOVARMuTect SnpEffSnpSift VAT

FATHMM LRT

PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix

KEGG STRING BEReX

DAPPLE

Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine

programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)

24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate

Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata

FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]

Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs

Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of

4 International Journal of Genomics

Table1

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Alignm

enttools

Burrow

s-Wheeler

Aligner

(BWA)

Perfo

rmshortreads

alignm

entu

singBW

Tapproach

againsta

references

geno

mea

llowingforg

apsmism

atches

httpbio-bw

asou

rceforgenet

[8]

Bowtie

(1amp2)

Perfo

rmssho

rtread

alignm

entu

singtheB

urrows-W

heele

rind

exin

ordertobe

mem

oryeffi

cientwhilestillmaintaining

analignm

ent

speedof

over

25million35

bpreadsp

erho

ur

httpbo

wtie

-bioso

urceforgen

etin

dexshtm

l[910]

ELAND

Shortreadalignerthatachievesspeed

bysplittin

greadsintoequal

leng

thsa

ndapplying

seed

templates

togu

aranteeh

itswith

only2

mism

atches

httpwwwillum

inac

om

Illum

inaInc

GEM

Shortreadaligneru

singstr

ingmatchinginste

adof

BWTto

deliver

precision

andspeed

httpalgorithm

scnagcatw

ikiTh

eGEM

library

[11]

GSN

AP

Perfo

rmssho

rtandlong

read

alignm

entdetectslon

gandshort

distance

splicingSN

Psand

iscapableo

fdetectin

gbisulfite-tr

eated

DNAform

ethylatio

nstu

dies

httpresearch-pub

genec

omgmap

[12]

MAQ

Shortreadalignerc

ompatib

lewith

Illum

ina-Solexa

andABI

SOLiD

dataperform

sung

appedalignm

entallo

wing2-3mism

atches

for

single-endreadsa

ndon

emism

atch

forp

aired-endreads

httpmaqso

urceforgen

et

[13]

mrFAST

Perfo

rmssho

rtread

alignm

entallo

wingforINDEL

supto

8bpfor

Illum

inag

enerated

dataPaired-endmapping

usingao

neend

anchored

algorithm

allowsfor

detectionof

novelinsertio

ns

httpmrfa

stsourceforgen

et

[14]

Novoalign

Alignm

entd

oneo

npaire

d-endor

single-endsequ

encesalso

capable

ofdo

ingmethylationstu

diesA

llowsfor

amism

atch

upto

50of

aread

leng

thandhasb

uilt-in

adaptera

ndbase

quality

trim

ming

httpwwwno

vocraft

comprodu

ctsno

voalign

httpwww

novocraftcom

SOAP(1amp2)

SOAP2

improved

speedby

anordero

fmagnitude

over

SOAP1

and

canalignaw

ider

ange

ofread

leng

thsa

tthe

speedof

2minutes

for

onem

illionsin

gle-endreadsu

singatwo-way

BWTalgorithm

httpsoapgenom

icso

rgcn

[1516]

SSAHA

Usesa

hashingalgorithm

tofin

dexacto

rclose

toexactm

atchingin

DNAandproteindatabasesanalogou

stodo

ingaB

LAST

search

for

each

read

httpsw

wwvectorbaseorgglossary

ssaha-sequ

ence-search-and-alignm

ent-h

ashing

-algorith

m

[17]

Stam

pyAlignm

entd

oneu

singah

ashing

algorithm

andsta

tistic

almod

elto

alignIllum

inar

eads

forg

enom

eRN

Aand

Chip

sequ

encing

allowing

fora

largen

umbero

rvariatio

nsinclu

ding

insertions

anddeletio

ns

httpwwwwelloxacukproject-s

tampy

[18]

International Journal of Genomics 5Ta

ble1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

YOABS

Usesa

0(119899)a

lgorith

mthatuses

both

hash

andtri-b

ased

metho

dsthat

aree

ffectiveinaligning

sequ

enceso

ver2

00bp

with

3tim

esless

mem

oryandtentim

esfaste

rthanSSAHA

Availableb

yrequ

estfor

noncom

mercialuse

[19]

HTS

eqPy

thon

basedpackagew

ithmanyfunctio

nsto

facilitates

everal

aspectso

fsequencingstu

dies

httpwww-hub

erem

bldeHTS

eqdocoverviewhtml

Auxiliary

tools

FastU

niq

Impo

rtssortsandidentifi

esPC

Rdu

plicates

ofshortsequences

from

sequ

encing

data

httpssou

rceforgenetprojectsfastu

niq

[23]

Picard

Picard

isas

etof

commandlin

etoo

lsform

anipulating

high

-throug

hput

sequ

encing

(HTS

)dataa

ndform

atssuchas

SAMBAMCRA

MandVC

Fhttppicardso

urceforgen

et

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

SNVandSV

calling

GAT

KVa

riant

calling

ofSN

Psandsm

allINDEL

scanalso

beused

onno

nhum

anandno

ndiploid

organism

shttpsw

wwbroadinstituteo

rggatk

[4ndash6

]

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

VCMM

Detectio

nof

SNVsa

ndIN

DEL

susin

gthem

ultin

omialprobabilistic

metho

din

WES

andWGSdata

httpem

usrcr

ikenjpVCM

M

[25]

FreeBa

yes

Detectio

nof

SNPsM

NPsINDEL

sandstr

ucturalvariants(SV

s)fro

msequ

encing

alignm

entsusingBa

yesia

nsta

tisticalmetho

ds

httpsgith

ubco

mekgfreebayes

[27]

indelM

INER

Splitread

algorithm

toidentifybreakp

oint

inIN

DEL

sfrom

paire

d-endsequ

encing

data

httpsgith

ubco

maakroshin

delM

INER

[32]

Pind

elDetectio

nof

INDEL

susin

gap

attern

grow

thapproach

with

anchor

pointsto

providen

ucleotide-levelresolution

httpgm

tgenom

ewustledupackagespindel

[30]

Platypus

Detectio

nof

SNPsM

NPsINDEL

sreplacem

ents

andstructural

varia

nts(SV

s)fro

msequ

encing

alignm

entsusinglocalrealignm

ent

andlocalassem

blyto

achieveh

ighspecificityandsensitivity

httpwwwwelloxacukplatypus

[26]

Splitread

Detectio

nof

INDEL

slessthan50

bplong

from

WES

orWGSdata

usingas

plit-read

algorithm

httpsplitreadso

urceforgen

et

[31]

Sprites

Detectio

nof

INDEL

sisd

oneu

singas

plit-read

andsoft-clipp

ing

approach

thatisespeciallysensitive

indatasetswith

lowcoverage

httpsgith

ubco

mzhang

zhensprites

[33]

VCFannotatio

n

ANNOVA

RProvides

up-to

-datea

nnotationof

VCFfiles

bygeneregion

and

filtersfro

mseveralother

databases

httpanno

varo

penb

ioinform

aticso

rg

[34]

MuT

ect

Postp

rocesses

varia

ntstoeliminatea

rtifa

ctsfrom

hybrid

capture

shortreadalignm

entandnext-generationsequ

encing

httpwwwbroadinstituteo

rgcancercgamutect

[35]

SnpE

ffUses3

800

0geno

mes

topredictand

anno

tatethee

ffectso

fvariants

ongenes

httpsnpeffsourceforgen

et

[36]

SnpSift

ToolstomanipulateV

CFfiles

inclu

ding

filterin

ganno

tatio

ncase

controls

transition

andtransversio

nratesa

ndmore

httpsnpeffsourceforgen

etSnp

Sifthtml

[37]

VAT

Ann

otationof

varia

ntsb

yfunctio

nalityin

acloud

compu

ting

environm

ent

httpvatgerste

inlaborg

[38]

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

2 International Journal of GenomicsN

umbe

r of p

ublic

atio

ns in

Pub

Med

0

200

400

600

800

1000

1200

1400

2011 2012 2013 2014 2015 520162010(Year)

30 4

12426

297

49

574

108192 238

105

730

930

1255

Whole Exome Sequencing (WES) StudiesWES Computational Tools

Figure 1 Trends in Whole Exome Sequencing studies and tools byquerying PubMed (2011ndash2016)

analysis can also includemethods that link variants to clinicaldata as well as potential therapeutics (Figure 2)

Computational tools developed to align raw sequencingdata to an annotated VCF file have been well establishedMost studies tend to follow workflows associated with GATK[4ndash6] SAMtools [7] or a combination of these In generalworkflows start with aligning WES reads to a referencegenome and noting reads that vary The most commonof these variants are single nucleotide variants (SNVs) butalso include insertions deletions and rearrangements Thelocation of these variants is used to annotate them to a specificgene After annotation the SNVs found can be compared todatabases of SNVs found in other studies This allows for thedetermination of frequency of a particular SNV in a givenpopulation In some studies such as those relating to cancerrare somaticmutations are of interest However inMendelianstudies the germline mutational landscape will be of moreinterest than somatic mutations Before a final VCF file isproduced for a given sample software can be used to predictif the variant will be functionally damaging to the protein forprioritizing candidate genes for further study

Bioinformatics methods developed beyond the establish-ment of annotated VCF files are far less established In cancerresearch the most established types of beyond VCF toolsare focused on the detection of somatic mutations Howeverthere are strides being made to develop other computationaltools including pathway analysis copy number alterationINDEL identification driver mutation predictions and link-ing candidate genes to clinical data and actionable targets

Here we will review recent computational tools in theanalysis and interpretation of WES data with special focuson the applications of these methods in cancer researchWe have surveyed the current trends in next-generationsequencing analysis tools and compared their methodologyso that researches can better determine which tools are thebest for their WES study and the advancement of precisionmedicine In addition we include a list of publicly available

bioinformatics and computational tools as a reference forWES studies (Table 1)

2 Computational Tools in Pre-VCF Analyses

Alignment removal of duplicates variant calling annotationfiltration and prediction are all parts of the steps leading upto the generation of a filtered and annotated VCF file Herewe review each one of these steps as shown in Figure 2 andcompare and contrast some of the tools that can be used toperform the Pre-VCF analysis steps

21 Alignment Tools The first step in any analysis of next-generation sequencing is to align the sequencing reads to areference genomeThe twomost common reference genomesfor humans currently are hg18 and hg19 Several aligningalgorithms have been developed including but not limited toBWA [8] Bowtie 1 [9] and 2 [10] GEM [11] ELAND (Illu-mina Inc) GSNAP [12] MAQ [13] mrFAST [14] Novoalign(httpwwwnovocraftcom) SOAP 1 [15] and 2 [16] SSAHA[17] Stampy [18] and YOABS [19] Each method has its ownunique features and many papers have reviewed the differ-ences between them [20ndash22] and we will not review thesetools in depth here The three most commonly used of thesealgorithms are BWA Bowtie (1 and 2) and SOAP (1 and 2)

22 Auxiliary Tools Some auxiliary tools have been devel-oped to filter aligned reads to ensure higher quality data fordownstream analyses PCR amplification can introduceduplicate reads of paired-end reads in sequencing dataTheseduplicate reads can influence the depth of the mapped readsanddownstreamanalyses For example if a variant is detectedin duplicate reads the proportion of reads containing avariant could pass the threshold needed for variant callingthus calling a falsely positive variant Therefore removingduplicate reads is a crucial step in accurately representingthe sequencing depth during downstream analyses Severaltools have been developed to detect PCR duplicates includingPicard (httppicardsourceforgenet) FastUniq [23] andSAMtools [7] SAMtools rmdup finds reads that start andend at the same position find the read with the highestquality score and mark the rest of the duplicates for removalPicard finds identical 51015840 positions for both reads in a matepair and marks them as duplicates In contrast FastUniqtakes a de novo approach to quickly identify PCR duplicatesFastUniq imports all reads sorts them according to theirlocation and thenmarks duplicatesThis allows FastUniq notto require complete genome sequences as prerequisites Dueto the different algorithms each of these tools use these toolscan remove PCR duplicates individually or in combination

23 Methods for Single Nucleotide Variants (SNVs) CallingAfter sequences have been aligned to the reference genomethe next step is to perform variant detection in theWES dataThere are four general categories of variant calling strategiesgermline variants somatic variants copy number variationsand structural variants Multiple tools that perform oneor more of these variant calling techniques were recentlycompared to each other [24] Some common SNV calling

International Journal of Genomics 3

Pre-VCF analyses Pre-VCF analyses

Alignment tools

mrFAST

Auxiliary tools

SNV amp SV calling toolsFor example

GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM

VCF annotation tools

Database filtrationFor example

Genome Project dbSNP LOVD COSMIC

Tools for functional variants prediction For example CADD

MetaLR MetaSVM

Somatic mutationdetection toolsFor example

MuTect VarSimSomVarIUS

Driver prediction tools

Copy number alteration toolsFor example

ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC

VarScan2 ExomeAI

Pathway resourcesFor example

For exampleFor exampleFor example

Pathway analysis toolsFor example

SNPsea DAVID

WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud

TherapeuticPrediction

Biologicalinsights

precisionmedicine

amp

Wholeexome

sequencingdata

(Fastq)SNV amp SV calling

VCFVCF

annotationFunctionalprediction Somaticmutations

DriverpredictionCopy

numberalterationsPathwayanalysis

Alignment

Therapeutic resources

Genome ClinVarPharmGKBDrugBank DSigDB

For example My Cancer

For example BWABOWTIE ELAND GEM GNSAP MAQ

Novoalign SOAP SSHA STAMPY YOBAS

SomaticSniperCNV-seq SegSeq 1000

FreeBayes

CNVseeqer

FastUniqPicard SAMtools

ANNOVARMuTect SnpEffSnpSift VAT

FATHMM LRT

PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix

KEGG STRING BEReX

DAPPLE

Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine

programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)

24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate

Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata

FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]

Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs

Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of

4 International Journal of Genomics

Table1

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Alignm

enttools

Burrow

s-Wheeler

Aligner

(BWA)

Perfo

rmshortreads

alignm

entu

singBW

Tapproach

againsta

references

geno

mea

llowingforg

apsmism

atches

httpbio-bw

asou

rceforgenet

[8]

Bowtie

(1amp2)

Perfo

rmssho

rtread

alignm

entu

singtheB

urrows-W

heele

rind

exin

ordertobe

mem

oryeffi

cientwhilestillmaintaining

analignm

ent

speedof

over

25million35

bpreadsp

erho

ur

httpbo

wtie

-bioso

urceforgen

etin

dexshtm

l[910]

ELAND

Shortreadalignerthatachievesspeed

bysplittin

greadsintoequal

leng

thsa

ndapplying

seed

templates

togu

aranteeh

itswith

only2

mism

atches

httpwwwillum

inac

om

Illum

inaInc

GEM

Shortreadaligneru

singstr

ingmatchinginste

adof

BWTto

deliver

precision

andspeed

httpalgorithm

scnagcatw

ikiTh

eGEM

library

[11]

GSN

AP

Perfo

rmssho

rtandlong

read

alignm

entdetectslon

gandshort

distance

splicingSN

Psand

iscapableo

fdetectin

gbisulfite-tr

eated

DNAform

ethylatio

nstu

dies

httpresearch-pub

genec

omgmap

[12]

MAQ

Shortreadalignerc

ompatib

lewith

Illum

ina-Solexa

andABI

SOLiD

dataperform

sung

appedalignm

entallo

wing2-3mism

atches

for

single-endreadsa

ndon

emism

atch

forp

aired-endreads

httpmaqso

urceforgen

et

[13]

mrFAST

Perfo

rmssho

rtread

alignm

entallo

wingforINDEL

supto

8bpfor

Illum

inag

enerated

dataPaired-endmapping

usingao

neend

anchored

algorithm

allowsfor

detectionof

novelinsertio

ns

httpmrfa

stsourceforgen

et

[14]

Novoalign

Alignm

entd

oneo

npaire

d-endor

single-endsequ

encesalso

capable

ofdo

ingmethylationstu

diesA

llowsfor

amism

atch

upto

50of

aread

leng

thandhasb

uilt-in

adaptera

ndbase

quality

trim

ming

httpwwwno

vocraft

comprodu

ctsno

voalign

httpwww

novocraftcom

SOAP(1amp2)

SOAP2

improved

speedby

anordero

fmagnitude

over

SOAP1

and

canalignaw

ider

ange

ofread

leng

thsa

tthe

speedof

2minutes

for

onem

illionsin

gle-endreadsu

singatwo-way

BWTalgorithm

httpsoapgenom

icso

rgcn

[1516]

SSAHA

Usesa

hashingalgorithm

tofin

dexacto

rclose

toexactm

atchingin

DNAandproteindatabasesanalogou

stodo

ingaB

LAST

search

for

each

read

httpsw

wwvectorbaseorgglossary

ssaha-sequ

ence-search-and-alignm

ent-h

ashing

-algorith

m

[17]

Stam

pyAlignm

entd

oneu

singah

ashing

algorithm

andsta

tistic

almod

elto

alignIllum

inar

eads

forg

enom

eRN

Aand

Chip

sequ

encing

allowing

fora

largen

umbero

rvariatio

nsinclu

ding

insertions

anddeletio

ns

httpwwwwelloxacukproject-s

tampy

[18]

International Journal of Genomics 5Ta

ble1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

YOABS

Usesa

0(119899)a

lgorith

mthatuses

both

hash

andtri-b

ased

metho

dsthat

aree

ffectiveinaligning

sequ

enceso

ver2

00bp

with

3tim

esless

mem

oryandtentim

esfaste

rthanSSAHA

Availableb

yrequ

estfor

noncom

mercialuse

[19]

HTS

eqPy

thon

basedpackagew

ithmanyfunctio

nsto

facilitates

everal

aspectso

fsequencingstu

dies

httpwww-hub

erem

bldeHTS

eqdocoverviewhtml

Auxiliary

tools

FastU

niq

Impo

rtssortsandidentifi

esPC

Rdu

plicates

ofshortsequences

from

sequ

encing

data

httpssou

rceforgenetprojectsfastu

niq

[23]

Picard

Picard

isas

etof

commandlin

etoo

lsform

anipulating

high

-throug

hput

sequ

encing

(HTS

)dataa

ndform

atssuchas

SAMBAMCRA

MandVC

Fhttppicardso

urceforgen

et

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

SNVandSV

calling

GAT

KVa

riant

calling

ofSN

Psandsm

allINDEL

scanalso

beused

onno

nhum

anandno

ndiploid

organism

shttpsw

wwbroadinstituteo

rggatk

[4ndash6

]

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

VCMM

Detectio

nof

SNVsa

ndIN

DEL

susin

gthem

ultin

omialprobabilistic

metho

din

WES

andWGSdata

httpem

usrcr

ikenjpVCM

M

[25]

FreeBa

yes

Detectio

nof

SNPsM

NPsINDEL

sandstr

ucturalvariants(SV

s)fro

msequ

encing

alignm

entsusingBa

yesia

nsta

tisticalmetho

ds

httpsgith

ubco

mekgfreebayes

[27]

indelM

INER

Splitread

algorithm

toidentifybreakp

oint

inIN

DEL

sfrom

paire

d-endsequ

encing

data

httpsgith

ubco

maakroshin

delM

INER

[32]

Pind

elDetectio

nof

INDEL

susin

gap

attern

grow

thapproach

with

anchor

pointsto

providen

ucleotide-levelresolution

httpgm

tgenom

ewustledupackagespindel

[30]

Platypus

Detectio

nof

SNPsM

NPsINDEL

sreplacem

ents

andstructural

varia

nts(SV

s)fro

msequ

encing

alignm

entsusinglocalrealignm

ent

andlocalassem

blyto

achieveh

ighspecificityandsensitivity

httpwwwwelloxacukplatypus

[26]

Splitread

Detectio

nof

INDEL

slessthan50

bplong

from

WES

orWGSdata

usingas

plit-read

algorithm

httpsplitreadso

urceforgen

et

[31]

Sprites

Detectio

nof

INDEL

sisd

oneu

singas

plit-read

andsoft-clipp

ing

approach

thatisespeciallysensitive

indatasetswith

lowcoverage

httpsgith

ubco

mzhang

zhensprites

[33]

VCFannotatio

n

ANNOVA

RProvides

up-to

-datea

nnotationof

VCFfiles

bygeneregion

and

filtersfro

mseveralother

databases

httpanno

varo

penb

ioinform

aticso

rg

[34]

MuT

ect

Postp

rocesses

varia

ntstoeliminatea

rtifa

ctsfrom

hybrid

capture

shortreadalignm

entandnext-generationsequ

encing

httpwwwbroadinstituteo

rgcancercgamutect

[35]

SnpE

ffUses3

800

0geno

mes

topredictand

anno

tatethee

ffectso

fvariants

ongenes

httpsnpeffsourceforgen

et

[36]

SnpSift

ToolstomanipulateV

CFfiles

inclu

ding

filterin

ganno

tatio

ncase

controls

transition

andtransversio

nratesa

ndmore

httpsnpeffsourceforgen

etSnp

Sifthtml

[37]

VAT

Ann

otationof

varia

ntsb

yfunctio

nalityin

acloud

compu

ting

environm

ent

httpvatgerste

inlaborg

[38]

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 3

Pre-VCF analyses Pre-VCF analyses

Alignment tools

mrFAST

Auxiliary tools

SNV amp SV calling toolsFor example

GATK IndelMINERPindel Platypus SAMtoolsSplitread SpritesVCMM

VCF annotation tools

Database filtrationFor example

Genome Project dbSNP LOVD COSMIC

Tools for functional variants prediction For example CADD

MetaLR MetaSVM

Somatic mutationdetection toolsFor example

MuTect VarSimSomVarIUS

Driver prediction tools

Copy number alteration toolsFor example

ADTEx CONTRA EXCAVATOR ExomeCNVControl-FREEC

VarScan2 ExomeAI

Pathway resourcesFor example

For exampleFor exampleFor example

Pathway analysis toolsFor example

SNPsea DAVID

WES data analysis pipeline for example SeqMule Fastq2vcf IMPACT genomes on the cloud

TherapeuticPrediction

Biologicalinsights

precisionmedicine

amp

Wholeexome

sequencingdata

(Fastq)SNV amp SV calling

VCFVCF

annotationFunctionalprediction Somaticmutations

DriverpredictionCopy

numberalterationsPathwayanalysis

Alignment

Therapeutic resources

Genome ClinVarPharmGKBDrugBank DSigDB

For example My Cancer

For example BWABOWTIE ELAND GEM GNSAP MAQ

Novoalign SOAP SSHA STAMPY YOBAS

SomaticSniperCNV-seq SegSeq 1000

FreeBayes

CNVseeqer

FastUniqPicard SAMtools

ANNOVARMuTect SnpEffSnpSift VAT

FATHMM LRT

PolyPhen-2 SIFT VESTCHASM MutSigCVDendrix

KEGG STRING BEReX

DAPPLE

Figure 2 Whole Exome Sequencing data analysis steps Novel computational methods and tools have been developed to analyze the fullspectrum of WES data translating raw fastq files to biological insights and precision medicine

programs are GATK [4ndash6] SAMtools [7] and VCMM [25]The actual SNV calling mechanisms of GATK and SAMtoolsare very similar However the context before and afterSNV calling represents the differences between these toolsGATK assumes each sequencing error is independent whileSAMtools believes a secondary error carries more weightAfter SNV calling GATK learns from data while SAMtoolsrelies on options of the user Variant Caller with Multinomialprobabilistic Model (VCMM) is another tool developed todetect SNVs and INDELs from WES and Whole GenomeSequencing (WGS) studies using a multinomial probabilisticmodel with quality score and a strand bias filter [25] VCMMsuppressed the false-positive and false-negative variant callswhen compared to GATK and SAMtools However thenumber of variant calls was similar to previous studies Thecomparison done by the authors of VCMM demonstratedthat while all three methods call a large number of commonSNVs each tool also identifies SNVs not found by the othermethods [25] The ability of each method to call SNVs notfound by the others should be taken into account whenchoosing a SNV variant calling tool(s)

24 Methods for Structural Variants (SVs) IdentificationStructural Variants (SVs) such as insertions and deletions(INDELs) in high-throughput sequencing data aremore chal-lenging to identify than single nucleotide variants becausethey could involve an undefined number of nucleotides Themajority of WES studies follow SAMtools [7] or GATK [4ndash6]workflows which will identify INDELs in the data Howeverother software has been developed to increase the sensitivityof INDELdiscoverywhile simultaneously decreasing the falsediscovery rate

Platypus [26] was developed to find SNVs INDELs andcomplex polymorphisms using local de novo assemblyWhencompared to SAMtools and GATK Platypus had the lowestFosmid false discovery rate for both SNVs and INDELs inwhole genome sequencing of 15 samples It also had the short-est runtime of these toolsHoweverGATKand SAMtools hadlower Fosmid false discovery rates than Platypus when find-ing SNVs and INDELs inWES data [26] Therefore Platypusseems to be appropriate for whole genome sequencing butcaution should be used when utilizing this tool with WESdata

FreeBayes uses a unique approach to INDEL detectioncompared to other tools The method utilizes haplotype-based variant detection under a Bayesian statistics framework[27] This method has been used in several studies in com-bination with other approaches for the identifying of uniqueINDELs [28 29]

Pindel was one of the first programs developed to addressthe issue of unidentified large INDELs due to the short lengthof WGS reads [30] In brief after alignment of the reads tothe reference genome Pindel identifies reads where one endwasmapped and the otherwas not [30]Then Pindel searchesthe reference genome for the unmapped portion of this readover a user defined area of the genome [30] This split-read algorithm successfully identified large INDELs Othercomputational tools developed after Pindel still utilize thisalgorithm as the foundation in their methods for detectingINDELs

Splitread [31] was developed to specifically identify struc-tural variants and INDELs in WES data from 1 bp to 1Mbpbuilding on the split-read approach of Pindel [30] Thealgorithms used by SAMtools and GATK limit the size of

4 International Journal of Genomics

Table1

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Alignm

enttools

Burrow

s-Wheeler

Aligner

(BWA)

Perfo

rmshortreads

alignm

entu

singBW

Tapproach

againsta

references

geno

mea

llowingforg

apsmism

atches

httpbio-bw

asou

rceforgenet

[8]

Bowtie

(1amp2)

Perfo

rmssho

rtread

alignm

entu

singtheB

urrows-W

heele

rind

exin

ordertobe

mem

oryeffi

cientwhilestillmaintaining

analignm

ent

speedof

over

25million35

bpreadsp

erho

ur

httpbo

wtie

-bioso

urceforgen

etin

dexshtm

l[910]

ELAND

Shortreadalignerthatachievesspeed

bysplittin

greadsintoequal

leng

thsa

ndapplying

seed

templates

togu

aranteeh

itswith

only2

mism

atches

httpwwwillum

inac

om

Illum

inaInc

GEM

Shortreadaligneru

singstr

ingmatchinginste

adof

BWTto

deliver

precision

andspeed

httpalgorithm

scnagcatw

ikiTh

eGEM

library

[11]

GSN

AP

Perfo

rmssho

rtandlong

read

alignm

entdetectslon

gandshort

distance

splicingSN

Psand

iscapableo

fdetectin

gbisulfite-tr

eated

DNAform

ethylatio

nstu

dies

httpresearch-pub

genec

omgmap

[12]

MAQ

Shortreadalignerc

ompatib

lewith

Illum

ina-Solexa

andABI

SOLiD

dataperform

sung

appedalignm

entallo

wing2-3mism

atches

for

single-endreadsa

ndon

emism

atch

forp

aired-endreads

httpmaqso

urceforgen

et

[13]

mrFAST

Perfo

rmssho

rtread

alignm

entallo

wingforINDEL

supto

8bpfor

Illum

inag

enerated

dataPaired-endmapping

usingao

neend

anchored

algorithm

allowsfor

detectionof

novelinsertio

ns

httpmrfa

stsourceforgen

et

[14]

Novoalign

Alignm

entd

oneo

npaire

d-endor

single-endsequ

encesalso

capable

ofdo

ingmethylationstu

diesA

llowsfor

amism

atch

upto

50of

aread

leng

thandhasb

uilt-in

adaptera

ndbase

quality

trim

ming

httpwwwno

vocraft

comprodu

ctsno

voalign

httpwww

novocraftcom

SOAP(1amp2)

SOAP2

improved

speedby

anordero

fmagnitude

over

SOAP1

and

canalignaw

ider

ange

ofread

leng

thsa

tthe

speedof

2minutes

for

onem

illionsin

gle-endreadsu

singatwo-way

BWTalgorithm

httpsoapgenom

icso

rgcn

[1516]

SSAHA

Usesa

hashingalgorithm

tofin

dexacto

rclose

toexactm

atchingin

DNAandproteindatabasesanalogou

stodo

ingaB

LAST

search

for

each

read

httpsw

wwvectorbaseorgglossary

ssaha-sequ

ence-search-and-alignm

ent-h

ashing

-algorith

m

[17]

Stam

pyAlignm

entd

oneu

singah

ashing

algorithm

andsta

tistic

almod

elto

alignIllum

inar

eads

forg

enom

eRN

Aand

Chip

sequ

encing

allowing

fora

largen

umbero

rvariatio

nsinclu

ding

insertions

anddeletio

ns

httpwwwwelloxacukproject-s

tampy

[18]

International Journal of Genomics 5Ta

ble1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

YOABS

Usesa

0(119899)a

lgorith

mthatuses

both

hash

andtri-b

ased

metho

dsthat

aree

ffectiveinaligning

sequ

enceso

ver2

00bp

with

3tim

esless

mem

oryandtentim

esfaste

rthanSSAHA

Availableb

yrequ

estfor

noncom

mercialuse

[19]

HTS

eqPy

thon

basedpackagew

ithmanyfunctio

nsto

facilitates

everal

aspectso

fsequencingstu

dies

httpwww-hub

erem

bldeHTS

eqdocoverviewhtml

Auxiliary

tools

FastU

niq

Impo

rtssortsandidentifi

esPC

Rdu

plicates

ofshortsequences

from

sequ

encing

data

httpssou

rceforgenetprojectsfastu

niq

[23]

Picard

Picard

isas

etof

commandlin

etoo

lsform

anipulating

high

-throug

hput

sequ

encing

(HTS

)dataa

ndform

atssuchas

SAMBAMCRA

MandVC

Fhttppicardso

urceforgen

et

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

SNVandSV

calling

GAT

KVa

riant

calling

ofSN

Psandsm

allINDEL

scanalso

beused

onno

nhum

anandno

ndiploid

organism

shttpsw

wwbroadinstituteo

rggatk

[4ndash6

]

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

VCMM

Detectio

nof

SNVsa

ndIN

DEL

susin

gthem

ultin

omialprobabilistic

metho

din

WES

andWGSdata

httpem

usrcr

ikenjpVCM

M

[25]

FreeBa

yes

Detectio

nof

SNPsM

NPsINDEL

sandstr

ucturalvariants(SV

s)fro

msequ

encing

alignm

entsusingBa

yesia

nsta

tisticalmetho

ds

httpsgith

ubco

mekgfreebayes

[27]

indelM

INER

Splitread

algorithm

toidentifybreakp

oint

inIN

DEL

sfrom

paire

d-endsequ

encing

data

httpsgith

ubco

maakroshin

delM

INER

[32]

Pind

elDetectio

nof

INDEL

susin

gap

attern

grow

thapproach

with

anchor

pointsto

providen

ucleotide-levelresolution

httpgm

tgenom

ewustledupackagespindel

[30]

Platypus

Detectio

nof

SNPsM

NPsINDEL

sreplacem

ents

andstructural

varia

nts(SV

s)fro

msequ

encing

alignm

entsusinglocalrealignm

ent

andlocalassem

blyto

achieveh

ighspecificityandsensitivity

httpwwwwelloxacukplatypus

[26]

Splitread

Detectio

nof

INDEL

slessthan50

bplong

from

WES

orWGSdata

usingas

plit-read

algorithm

httpsplitreadso

urceforgen

et

[31]

Sprites

Detectio

nof

INDEL

sisd

oneu

singas

plit-read

andsoft-clipp

ing

approach

thatisespeciallysensitive

indatasetswith

lowcoverage

httpsgith

ubco

mzhang

zhensprites

[33]

VCFannotatio

n

ANNOVA

RProvides

up-to

-datea

nnotationof

VCFfiles

bygeneregion

and

filtersfro

mseveralother

databases

httpanno

varo

penb

ioinform

aticso

rg

[34]

MuT

ect

Postp

rocesses

varia

ntstoeliminatea

rtifa

ctsfrom

hybrid

capture

shortreadalignm

entandnext-generationsequ

encing

httpwwwbroadinstituteo

rgcancercgamutect

[35]

SnpE

ffUses3

800

0geno

mes

topredictand

anno

tatethee

ffectso

fvariants

ongenes

httpsnpeffsourceforgen

et

[36]

SnpSift

ToolstomanipulateV

CFfiles

inclu

ding

filterin

ganno

tatio

ncase

controls

transition

andtransversio

nratesa

ndmore

httpsnpeffsourceforgen

etSnp

Sifthtml

[37]

VAT

Ann

otationof

varia

ntsb

yfunctio

nalityin

acloud

compu

ting

environm

ent

httpvatgerste

inlaborg

[38]

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

4 International Journal of Genomics

Table1

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Alignm

enttools

Burrow

s-Wheeler

Aligner

(BWA)

Perfo

rmshortreads

alignm

entu

singBW

Tapproach

againsta

references

geno

mea

llowingforg

apsmism

atches

httpbio-bw

asou

rceforgenet

[8]

Bowtie

(1amp2)

Perfo

rmssho

rtread

alignm

entu

singtheB

urrows-W

heele

rind

exin

ordertobe

mem

oryeffi

cientwhilestillmaintaining

analignm

ent

speedof

over

25million35

bpreadsp

erho

ur

httpbo

wtie

-bioso

urceforgen

etin

dexshtm

l[910]

ELAND

Shortreadalignerthatachievesspeed

bysplittin

greadsintoequal

leng

thsa

ndapplying

seed

templates

togu

aranteeh

itswith

only2

mism

atches

httpwwwillum

inac

om

Illum

inaInc

GEM

Shortreadaligneru

singstr

ingmatchinginste

adof

BWTto

deliver

precision

andspeed

httpalgorithm

scnagcatw

ikiTh

eGEM

library

[11]

GSN

AP

Perfo

rmssho

rtandlong

read

alignm

entdetectslon

gandshort

distance

splicingSN

Psand

iscapableo

fdetectin

gbisulfite-tr

eated

DNAform

ethylatio

nstu

dies

httpresearch-pub

genec

omgmap

[12]

MAQ

Shortreadalignerc

ompatib

lewith

Illum

ina-Solexa

andABI

SOLiD

dataperform

sung

appedalignm

entallo

wing2-3mism

atches

for

single-endreadsa

ndon

emism

atch

forp

aired-endreads

httpmaqso

urceforgen

et

[13]

mrFAST

Perfo

rmssho

rtread

alignm

entallo

wingforINDEL

supto

8bpfor

Illum

inag

enerated

dataPaired-endmapping

usingao

neend

anchored

algorithm

allowsfor

detectionof

novelinsertio

ns

httpmrfa

stsourceforgen

et

[14]

Novoalign

Alignm

entd

oneo

npaire

d-endor

single-endsequ

encesalso

capable

ofdo

ingmethylationstu

diesA

llowsfor

amism

atch

upto

50of

aread

leng

thandhasb

uilt-in

adaptera

ndbase

quality

trim

ming

httpwwwno

vocraft

comprodu

ctsno

voalign

httpwww

novocraftcom

SOAP(1amp2)

SOAP2

improved

speedby

anordero

fmagnitude

over

SOAP1

and

canalignaw

ider

ange

ofread

leng

thsa

tthe

speedof

2minutes

for

onem

illionsin

gle-endreadsu

singatwo-way

BWTalgorithm

httpsoapgenom

icso

rgcn

[1516]

SSAHA

Usesa

hashingalgorithm

tofin

dexacto

rclose

toexactm

atchingin

DNAandproteindatabasesanalogou

stodo

ingaB

LAST

search

for

each

read

httpsw

wwvectorbaseorgglossary

ssaha-sequ

ence-search-and-alignm

ent-h

ashing

-algorith

m

[17]

Stam

pyAlignm

entd

oneu

singah

ashing

algorithm

andsta

tistic

almod

elto

alignIllum

inar

eads

forg

enom

eRN

Aand

Chip

sequ

encing

allowing

fora

largen

umbero

rvariatio

nsinclu

ding

insertions

anddeletio

ns

httpwwwwelloxacukproject-s

tampy

[18]

International Journal of Genomics 5Ta

ble1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

YOABS

Usesa

0(119899)a

lgorith

mthatuses

both

hash

andtri-b

ased

metho

dsthat

aree

ffectiveinaligning

sequ

enceso

ver2

00bp

with

3tim

esless

mem

oryandtentim

esfaste

rthanSSAHA

Availableb

yrequ

estfor

noncom

mercialuse

[19]

HTS

eqPy

thon

basedpackagew

ithmanyfunctio

nsto

facilitates

everal

aspectso

fsequencingstu

dies

httpwww-hub

erem

bldeHTS

eqdocoverviewhtml

Auxiliary

tools

FastU

niq

Impo

rtssortsandidentifi

esPC

Rdu

plicates

ofshortsequences

from

sequ

encing

data

httpssou

rceforgenetprojectsfastu

niq

[23]

Picard

Picard

isas

etof

commandlin

etoo

lsform

anipulating

high

-throug

hput

sequ

encing

(HTS

)dataa

ndform

atssuchas

SAMBAMCRA

MandVC

Fhttppicardso

urceforgen

et

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

SNVandSV

calling

GAT

KVa

riant

calling

ofSN

Psandsm

allINDEL

scanalso

beused

onno

nhum

anandno

ndiploid

organism

shttpsw

wwbroadinstituteo

rggatk

[4ndash6

]

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

VCMM

Detectio

nof

SNVsa

ndIN

DEL

susin

gthem

ultin

omialprobabilistic

metho

din

WES

andWGSdata

httpem

usrcr

ikenjpVCM

M

[25]

FreeBa

yes

Detectio

nof

SNPsM

NPsINDEL

sandstr

ucturalvariants(SV

s)fro

msequ

encing

alignm

entsusingBa

yesia

nsta

tisticalmetho

ds

httpsgith

ubco

mekgfreebayes

[27]

indelM

INER

Splitread

algorithm

toidentifybreakp

oint

inIN

DEL

sfrom

paire

d-endsequ

encing

data

httpsgith

ubco

maakroshin

delM

INER

[32]

Pind

elDetectio

nof

INDEL

susin

gap

attern

grow

thapproach

with

anchor

pointsto

providen

ucleotide-levelresolution

httpgm

tgenom

ewustledupackagespindel

[30]

Platypus

Detectio

nof

SNPsM

NPsINDEL

sreplacem

ents

andstructural

varia

nts(SV

s)fro

msequ

encing

alignm

entsusinglocalrealignm

ent

andlocalassem

blyto

achieveh

ighspecificityandsensitivity

httpwwwwelloxacukplatypus

[26]

Splitread

Detectio

nof

INDEL

slessthan50

bplong

from

WES

orWGSdata

usingas

plit-read

algorithm

httpsplitreadso

urceforgen

et

[31]

Sprites

Detectio

nof

INDEL

sisd

oneu

singas

plit-read

andsoft-clipp

ing

approach

thatisespeciallysensitive

indatasetswith

lowcoverage

httpsgith

ubco

mzhang

zhensprites

[33]

VCFannotatio

n

ANNOVA

RProvides

up-to

-datea

nnotationof

VCFfiles

bygeneregion

and

filtersfro

mseveralother

databases

httpanno

varo

penb

ioinform

aticso

rg

[34]

MuT

ect

Postp

rocesses

varia

ntstoeliminatea

rtifa

ctsfrom

hybrid

capture

shortreadalignm

entandnext-generationsequ

encing

httpwwwbroadinstituteo

rgcancercgamutect

[35]

SnpE

ffUses3

800

0geno

mes

topredictand

anno

tatethee

ffectso

fvariants

ongenes

httpsnpeffsourceforgen

et

[36]

SnpSift

ToolstomanipulateV

CFfiles

inclu

ding

filterin

ganno

tatio

ncase

controls

transition

andtransversio

nratesa

ndmore

httpsnpeffsourceforgen

etSnp

Sifthtml

[37]

VAT

Ann

otationof

varia

ntsb

yfunctio

nalityin

acloud

compu

ting

environm

ent

httpvatgerste

inlaborg

[38]

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 5Ta

ble1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

YOABS

Usesa

0(119899)a

lgorith

mthatuses

both

hash

andtri-b

ased

metho

dsthat

aree

ffectiveinaligning

sequ

enceso

ver2

00bp

with

3tim

esless

mem

oryandtentim

esfaste

rthanSSAHA

Availableb

yrequ

estfor

noncom

mercialuse

[19]

HTS

eqPy

thon

basedpackagew

ithmanyfunctio

nsto

facilitates

everal

aspectso

fsequencingstu

dies

httpwww-hub

erem

bldeHTS

eqdocoverviewhtml

Auxiliary

tools

FastU

niq

Impo

rtssortsandidentifi

esPC

Rdu

plicates

ofshortsequences

from

sequ

encing

data

httpssou

rceforgenetprojectsfastu

niq

[23]

Picard

Picard

isas

etof

commandlin

etoo

lsform

anipulating

high

-throug

hput

sequ

encing

(HTS

)dataa

ndform

atssuchas

SAMBAMCRA

MandVC

Fhttppicardso

urceforgen

et

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

SNVandSV

calling

GAT

KVa

riant

calling

ofSN

Psandsm

allINDEL

scanalso

beused

onno

nhum

anandno

ndiploid

organism

shttpsw

wwbroadinstituteo

rggatk

[4ndash6

]

SAMtools

Suite

oftoolsc

apableof

view

ingindexing

editin

gwritingand

readingSA

MB

AMand

CRAM

form

attedfiles

httpwwwhtsliborg

[7]

VCMM

Detectio

nof

SNVsa

ndIN

DEL

susin

gthem

ultin

omialprobabilistic

metho

din

WES

andWGSdata

httpem

usrcr

ikenjpVCM

M

[25]

FreeBa

yes

Detectio

nof

SNPsM

NPsINDEL

sandstr

ucturalvariants(SV

s)fro

msequ

encing

alignm

entsusingBa

yesia

nsta

tisticalmetho

ds

httpsgith

ubco

mekgfreebayes

[27]

indelM

INER

Splitread

algorithm

toidentifybreakp

oint

inIN

DEL

sfrom

paire

d-endsequ

encing

data

httpsgith

ubco

maakroshin

delM

INER

[32]

Pind

elDetectio

nof

INDEL

susin

gap

attern

grow

thapproach

with

anchor

pointsto

providen

ucleotide-levelresolution

httpgm

tgenom

ewustledupackagespindel

[30]

Platypus

Detectio

nof

SNPsM

NPsINDEL

sreplacem

ents

andstructural

varia

nts(SV

s)fro

msequ

encing

alignm

entsusinglocalrealignm

ent

andlocalassem

blyto

achieveh

ighspecificityandsensitivity

httpwwwwelloxacukplatypus

[26]

Splitread

Detectio

nof

INDEL

slessthan50

bplong

from

WES

orWGSdata

usingas

plit-read

algorithm

httpsplitreadso

urceforgen

et

[31]

Sprites

Detectio

nof

INDEL

sisd

oneu

singas

plit-read

andsoft-clipp

ing

approach

thatisespeciallysensitive

indatasetswith

lowcoverage

httpsgith

ubco

mzhang

zhensprites

[33]

VCFannotatio

n

ANNOVA

RProvides

up-to

-datea

nnotationof

VCFfiles

bygeneregion

and

filtersfro

mseveralother

databases

httpanno

varo

penb

ioinform

aticso

rg

[34]

MuT

ect

Postp

rocesses

varia

ntstoeliminatea

rtifa

ctsfrom

hybrid

capture

shortreadalignm

entandnext-generationsequ

encing

httpwwwbroadinstituteo

rgcancercgamutect

[35]

SnpE

ffUses3

800

0geno

mes

topredictand

anno

tatethee

ffectso

fvariants

ongenes

httpsnpeffsourceforgen

et

[36]

SnpSift

ToolstomanipulateV

CFfiles

inclu

ding

filterin

ganno

tatio

ncase

controls

transition

andtransversio

nratesa

ndmore

httpsnpeffsourceforgen

etSnp

Sifthtml

[37]

VAT

Ann

otationof

varia

ntsb

yfunctio

nalityin

acloud

compu

ting

environm

ent

httpvatgerste

inlaborg

[38]

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

6 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Databasefi

ltration

1000

Genom

esProject

Genotypeinformationfro

map

opulationof

1000

healthyindividu

als

httpwww1000geno

mesorg

[41]

dbSN

PDatabaseo

fgenom

icvaria

ntsfrom

53organism

shttpsw

wwncbinlm

nihgovprojectsSN

P[39]

LOVD

Opensource

database

offre

elyavailableg

ene-centered

collectionof

DNAvaria

ntsa

ndsto

rage

ofpatie

ntandNGSdata

httpwwwlovdnl3

0hom

e[40]

COSM

ICDatabasec

ontainingsomaticmutations

from

human

cancers

separatedinto

expertcurateddataandgeno

me-wides

creenpu

blish

edin

scientificliterature

httpcancersa

ngerac

ukcosm

ic[42]

NHLB

IGOEx

ome

Sequ

encing

Project(ES

P)Databaseo

fgenes

andmechanism

sthatcon

tributetobloo

dlung

andheartd

isordersthrou

ghNGSdatain

vario

uspo

pulations

httpevsg

swashing

toneduEV

S

Exom

eAggregatio

nCon

sortium

(ExA

C)Databaseo

f60706un

related

individu

alsfrom

diseasea

ndpo

pulation

exom

esequencingstu

dies

httpexacbroadinstituteorg

[3]

SeattleSeqAnn

otation

Partof

theN

HBL

Isequencingprojectthisdatabase

contains

novel

andkn

ownSN

Vsa

ndIN

DEL

sincluding

accessionnu

mberfunctio

nof

thev

ariantand

HapMap

frequ

enciesclin

icalassociation

and

PolyPh

enpredictio

ns

httpsnpgswashing

toneduSeattleSeqA

nnotation137

Functio

nalpredictors

CADD

Machine

learning

algorithm

toscorea

llpo

ssible86million

substitutions

intheh

uman

referenceg

enom

efrom

1to99

basedon

know

nandsim

ulated

functio

nalvariants

httpcadd

gsw

ashing

toneduinfo

[49]

FATH

MM

UsesH

iddenMarkovMod

elstopredictthe

functio

nalcon

sequ

ences

ofSN

Vsincoding

andno

ncod

ingvaria

ntsthrou

ghaw

ebserver

httpfathmmbiocompu

teorguk

[46]

LRT

Usesthe

Likelih

oodRa

tiosta

tistic

altestto

compare

avariant

tokn

ownvaria

ntsa

nddeterm

ineiftheyarep

redicted

tobe

benign

deleterio

usoru

nkno

wn

httpgeno

mec

shlporgcon

tent19

915

53long

[45]

PolyPh

en-2

Predictspo

tentialimpactof

anon

syno

nymou

svariant

using

comparativ

eand

physicalcharacteris

tics

httpgenetic

sbwhharvardedupp

h2

[44]

SIFT

ByusingPS

I-BL

ASTa

predictio

ncanbe

madeo

nthee

ffectof

ano

nsyn

onym

ousm

utationwith

inap

rotein

httpsift

jcviorg

[43]

VES

TMachine

learning

approach

todeterm

inethe

prob

abilitythata

miss

ense

mutationwill

impairthefun

ctionalityof

aprotein

httpkarchinlaborgapp

sappV

esth

tml

[48]

MetaSVM

ampMetaLR

Integrationof

aSup

portVe

ctor

Machine

andLo

gisticRe

gressio

nto

integraten

ined

eleterious

predictio

nscores

ofmiss

ense

mutations

httpssitesg

ooglec

omsitejp

opgendb

NSFP

[47]

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 7

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Significant

somaticmutations

SomaticSn

iper

Usin

gtwobam

files

asinpu

tthistooluses

theg

enotypelikeliho

odmod

elof

MAZto

calculatethe

prob

abilitythatthetum

orandno

rmal

samples

ared

ifferentthus

identifying

somaticvaria

nts

httpgm

tgenom

ewustledupackagessom

atic-sniper

[50]

MuT

ect

Usin

gsta

tistic

alanalysisto

predictthe

likeliho

odof

asom

atic

mutationusingtwoBa

yesia

napproaches

httpsw

wwbroadinstituteo

rgcancercgamutect

[35]

VarSim

Byleveraging

onpreviouslyrepo

rted

mutationsa

rand

ommutation

simulationispreformed

topredictsom

aticmutations

httpbioinformgith

ubio

varsim

[51]

SomVa

rIUS

Identifi

catio

nof

somaticvaria

ntsfrom

unpaire

dtissues

amples

with

asequ

encing

depthof

150x

and67precision

implem

entedin

Python

httpsgith

ubco

mkylessm

ithSom

VarIUS

[52]

Copy

numbera

lteratio

n

Con

trol-F

REEC

Detectscopy

numberc

hanges

andlossof

heterozygosity(LOH)from

paire

dSA

MBAM

files

bycompu

tingandno

rmalizingcopy

number

andbetaallelefre

quency

httpbioinfo-ou

tcuriefrprojectsfre

ec

[59]

CNV-seq

Mappedread

coun

tisc

alculated

over

aslid

ingwindo

win

PerlandR

todeterm

inec

opynu

mberfrom

HTS

studies

httptig

erdbsnused

usgcnv-seq

[53]

SegSeq

Usin

g14

millionalignedsequ

ence

readsfrom

cancer

celllin

esequ

alcopy

numbera

lteratio

nsarec

alculatedfro

msequ

encing

data

httpsw

wwbroadinstituteo

rgcancercgasegseq

[54]

VarScan2

Determines

copy

numberc

hanges

inmatched

orun

matched

samples

usingread

ratio

sand

then

postp

rocessed

with

acirc

ular

binary

segm

entatio

nalgorithm

httpdk

oboldtgith

ubio

varscanusin

g-varscanhtml

[61]

Exom

eAI

Detectsalleleim

balanceincluding

LOHin

unmatched

tumor

samples

usingas

tatistic

alapproach

thatiscapableo

fhandlinglow-quality

datasets

httpgqinno

vatio

ncentercom

indexaspx

[64]

CNVseeqer

Exon

coverage

betweenmatched

sequ

encesw

ascalculated

usinglog 2

ratio

sfollowed

bythec

ircular

binary

segm

entatio

nalgorithm

httpicbmedco

rnelledu

wikiind

exphp

title=

Elem

entolab

CNVseeqerampredirect=n

o[60]

EXCA

VATO

RDetectscopy

numberv

ariantsfrom

WES

datain

3ste

psusinga

HiddenMarkovMod

elalgorithm

httpssou

rceforgenetprojectsexcavatortoo

l[57]

Exom

eCNV

Rpackageu

sedto

detectcopy

numberv

ariantso

flosso

fheterozygosityfro

mWES

data

httpssecureg

enom

eucla

eduindexph

pEx

omeC

NV

UserGuide

[58]

ADTE

xDetectio

nof

aberratio

nsin

tumor

exom

esby

detectingB-allele

frequ

encies

andim

plem

entedin

Rhttpadtexsourceforgen

et

[55]

CONTR

AUsesn

ormalized

depthof

coverage

todetectcopy

numberc

hanges

from

targeted

resequ

encing

datainclu

ding

WES

httpssou

rceforgenetprojectscontra-cnv

[56]

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

8 International Journal of Genomics

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

Driv

erpredictiontools

CHASM

Machine

learning

metho

dthatpredictsthefun

ctionalsignificance

ofsomaticmutations

httpkarchinlaborgapp

sappC

hasm

htm

l[65]

Dendrix

Den

ovodriversa

rediscovered

from

cancer

onlymutationald

ata

inclu

ding

genesnu

cleotidesord

omains

thathave

high

exclu

sivity

andcoverage

httpcompb

iocs

browneduprojectsdend

rix

[66]

MutSigC

VGene-specifica

ndpatie

nt-specific

mutationfre

quencies

are

incorporated

tofin

dmutations

ingenesthatare

mutated

moreo

ften

than

wou

ldbe

expected

bychance

httpwwwbroadinstituteo

rgcancersoftw

are

genepatte

rnm

odulesdocsMutSigC

V[67]

Pathwa

yana

lysistoolsa

ndresources

KEGG

Databaseu

singmapso

fkno

wnbiologicalprocessesthatallo

ws

searchingforg

enes

andcolorc

odingof

results

httpwwwgeno

mejpkegg

[68]

DAV

IDAllowsfor

userstoinpu

talarges

etof

genesa

nddiscover

the

functio

nalann

otationof

theg

enelist

inclu

ding

pathwaysgene

ontology

term

sandmore

httpsdavidncifcrfgov

[69]

STRING

Networkvisualizationof

protein-proteininteractions

ofover

2031

organism

shttpstr

ing-db

org

[70]

BERe

XUsesb

iomedicalkn

owledgetoallowuserstosearch

forrelationships

betweenbiom

edicalentities

httpinfosk

oreaac

krberex

[71]

DAPP

LEUsesa

listo

fgenes

todeterm

inep

hysic

alconn

ectiv

ityam

ongproteins

accordingto

protein-proteininteractions

httpjournalsplosorgplosgeneticsarticleid

=101371

journalpgen1001273

[72]

SNPsea

Usesa

linkage

disequ

ilibrium

todeterm

inep

athw

aysa

ndcelltypes

thatarelikely

tobe

affectedbasedon

SNPdata

httpwwwbroadinstituteo

rgm

pgsnp

sea

[73]

Toolsa

ndresourcesfor

linking

varia

ntstotherapeutics

cBioPo

rtal

Databasethatallo

wsthe

downloadanalysis

andvisualizationof

cancer

sequ

encing

studiesincluding

providingpatie

ntandclinical

dataforsam

ples

httpwwwcbiopo

rtalorg

[78]

MyCa

ncer

Genom

eDatabasefor

cancer

research

thatprovides

linkage

ofmutational

status

totherapiesa

ndavailablec

linicaltrials

httpsw

wwmycancergenom

eorg

httpwwwmycan-

cergenom

eorg

ClinVa

rDatabaseo

frelationshipbetweenph

enotypes

andhu

man

varia

tions

show

ingther

elationshipbetweenhealth

statusa

ndhu

man

varia

tions

andkn

ownim

plications

httpsw

wwncbinlm

nihgovclin

var

[74]

DSigD

BDatabaseo

fdrugsig

naturesthatincludes195

31genesa

nd17389

compo

unds

thatcanin

parthelpidentifycompo

unds

ford

rug

repu

rposingstu

dies

intransla

tionalresearch

httptanlabucdenveredu

DSigD

B[77]

PharmGKB

Know

ledgeb

asea

llowingvisualizationof

avarietyof

drug-gene

know

ledge

httpsw

wwph

armgkborg

[75]

DrugB

ank

Con

tainsd

etaileddrug

inform

ationwith

comprehensiv

edrugtarget

inform

ationfor8

206

drugs

httpwwwdrugbank

ca

[76]

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 9

Table1Con

tinued

Com

putatio

naltoo

lsDescriptio

nWebsite

References

WES

data

analysispipelin

es

fast2

VCF

Who

leEx

omeS

equencingpipelin

ethatstartsw

ithrawsequ

encing

(fastq

)files

andends

with

aVCF

filethath

asgood

capabilityforn

ovel

andexpertusers

httpfastq

2vcfso

urceforgen

et

[80]

SeqM

ule

WES

orWGSpipelin

ethatcom

binesthe

inform

ationfro

mover

ten

alignm

entand

analysistoolstoarriv

eata

VCFfilethatcan

beused

inbo

thMendelianandcancer

studies

httpseqm

uleo

penb

ioinform

aticso

rgenlatest

[79]

IMPA

CTWES

dataanalysispipelin

ethatstartsw

ithrawsequ

encing

readsa

ndanalyzes

SNVsa

ndCN

Asandlin

ksthisdatato

alist

ofprioritized

drug

sfrom

clinicaltria

lsandDSigD

Bhttptanlabucdenveredu

IMPA

CT

[81]

Genom

eson

theC

loud

(GotClou

d)

Automated

sequ

encing

pipelin

ethatp

erform

sinpartalignm

ent

varia

ntcalling

and

quality

controlthatcan

berunon

Amazon

Web

Services

EC2as

wellaslocalmachinesa

ndclu

sters

httpgeno

mes

phumicheduwikiG

otClou

d

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

10 International Journal of Genomics

structural variants with variants greater than 15 bp rarelybeing identified [31] Splitread anchors one end of a readand clusters the unanchored ends to identify size contentand location of structural variants [31] When comparedto GATK Splitread called 70 of the same INDELs butidentified 19 more unique INDELs 13 of which were verifiedby sanger sequencing [31] The unique ability of Splitread toidentify large structural variants and INDELs merits it beingused in conjunction with other INDEL detecting software inWES analysis

Recently developed indelMINER is a compilation of toolsthat takes the strengths of split-read and de novo assemblyto determine INDELs from paired-end reads of WGS data[32] Comparisons were done between SAMtools Pindel andindelMINER on a simulated dataset with 7500 INDELs [32]SAMtools found the least INDELs with 6491 followed byPindel with 7239 and indelMINER with 7365 INDELsidentified However indelMINERrsquos false-positive percentage(357) was higher than SAMtools (265) but lower thanPindel (453) Conversely indelMINER did have the lowestnumber of false-negatives with 398 compared to 589 and 1181for Pindel and SAMtools respectively Each of these tools hastheir own strengths and weaknesses as demonstrated by theauthors of indelMINER [32] Therefore it can be predictedthat future tools developed for SV detection will take anapproach similar to indelMINER in trying to incorporate thebest methods that have been developed thus far

Most of the recent SV detection tools rely on realigningsplit-reads for detecting deletions Instead of amore universalapproach like indelMINER Sprites [33] aims to solve theproblem of deletions with microhomologies and deletionswithmicroinsertions Sprites algorithm realigns soft-clippingreads to find the longest prefix or suffix that has amatch in thetarget sequence In terms of the 119865-score Sprites performedbetter than Pindel using real and simulated data [33]

All of these tools use different algorithms to address theproblem of structural variants which are common in humangenomes Each of these tools has strengths and weaknesses indetecting SVsTherefore it is suggested to use several of thesetools in combination to detect SVs in WES

25 VCFAnnotationMethods Once the variants are detectedand called the next step is to annotate these variantsThe twomost popular VCF annotation tools are ANNOVAR [34] andMuTect [35] which is part of the GATK pipeline ANNOVARwas developed in 2010 with the aim to rapidly annotatemillions of variants with ease and remains one of the popularvariant annotation methods to date [34] ANNOVAR canuse gene region or filter-based annotation to access over 20public databases for variants annotation MuTect is anothermethod that uses Bayesian classifiers for detecting andannotating variants [34 35] MuTect has been widely used incancer genomics research especially in The Cancer GenomeAtlas projects Other VCF annotation tools are SnpEff [36]and SnpSift [37] SnpEff can perform annotation for multiplevariants and SnpSift allows rapid detection of significantvariants from the VCF files [37] The Variant AnnotationTool (VAT) distinguishes itself from other annotation toolsin one aspect by adding cloud computing capabilities [38]

VAT annotation occurs at the transcript level to determinewhether all or only a subset of the transcript isoforms of agene is affected VAT is dynamic in that it also annotatesMultipleNucleotide Polymorphisms (MNPs) and can be usedon more than just the human species

26 Database and Resources for Variant Filtration Duringthe annotation process many resources and databases couldbe used as filtering criteria for detecting novel variants fromcommon polymorphisms These databases score a variant byits minor allelic frequency (MAF) within a specific popula-tion or study The need for filtration of variants based on thisnumber is subject to the purpose of the study For exampleMendelian studies would be interested in including commonSNVs while cancer studies usually focus on rare variantsfound in less than 1 of the population NCBI dbSNP data-base established in 2001 is an evolving database containingboth well-known and rare variants from many organisms[39] dbSNP also contains additional information includingdisease association genotype origin and somatic and germ-line variant information [39]

The Leiden Open Variation Database (LOVD) developedin 2005 links its database to several other repositories so thatthe user canmake comparisons and gain further information[40] One of the most popular SNV databases was developedin 2010 from the 1000 Genomes Project that uses statisticsfrom the sequencing of more than 1000 ldquohealthyrdquo people ofall ethnicities [41]This is especially helpful for cancer studiesas damaging mutations found in cancer are often very rare ina healthy population Another database essential for cancerstudies is the Catalogue of Somatic Mutations In Cancer(COSMIC) [42] This database of somatic mutations foundin cancer studies from almost 20000 publications allowsfor identification of potentially important cancer-related var-iants More recently the Exome Aggregation Consortium(ExAC) has assembled and reanalyzed WES data of 60706unrelated individuals from various disease-specific and pop-ulation genetic studies [3] The ExAC web portal and dataprovide a resource for assessing the significance of variantsdetected in WES data [3]

27 Functional Predictors of Mutation Besides knowing if aparticular variant has been previously identified researchersmay alsowant to determine the effect of a variantMany func-tional prediction tools have been developed that all varyslightly in their algorithms While individual predictionsoftware can be used ANNOVAR provides users with scoresfrom several different functional predictors including SIFTPolyPhen-2 LRT FATHMMMetaSVMMetaLR VEST andCADD [34]

SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based onclosely related sequence alignments [43] PolyPhen-2 uses apipeline involving eight sequence based methods and threestructure based methods in order to determine if a mutationis benign probably deleterious or known to be deleterious[44] The Likelihood Ratio Test (LRT) uses conservation be-tween closely related species to determine a mutations func-tional impact [45] When three genomes underwent analysis

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 11

by SIFT PolyPhen-2 and LRT only 5 of all predicteddeleterious mutations were agreed to be deleterious by allthree methods [45] Therefore it has been shown that usingmultiple mutational predictors is necessary for detecting awide range of deleterious SNVs FATHMMemploys sequenceconservation within Hidden Markov Models for predictingthe functional effects of protein missense mutation [46]FATHMMweighs mutations based on their pathogenicity bythe predicted interaction of the protein domain [46]

MetaSVM and MetaLR represent two ensemble meth-ods that combine 10 predictor scores (SIFT PolyPhen-2HDIV PolyPhen-2 HVAR GERP++ MutationTaster Muta-tion Assessor FATHMM LRT SiPhy and PhyloP) and themaximum frequency observed in the 1000 genomes popula-tions for predicting the deleterious variants [47] MetaSVMand MetaLR are based on the ensemble Support VectorMachine (SVM) and Logistic Regression (LR) respectivelyfor predicting the final variant scores [47]

The Variant Effect Scoring Tool (VEST) is similar toMetaSVM and MetaLR in that it uses a training set andmachine learning to predict functionality of mutations [48]Themain difference in the VEST approach is that the trainingset and prediction methodology are specifically designed forMendelian studies [48] The Combined Annotation Depen-dent Depletion (CADD) method differentiates itself by inte-grating multiple variants with mutations that have survivednatural selection as well as simulated mutations [49]

While all of these methods predict the functionality of amutation they all vary slightly in their methodological andbiological assumptions Dong et al have recently testedthe performance of these prediction algorithms on knowndatasets [47] They pointed out that these methods rarelyunanimously agree on if a mutation is deleterious Thereforeit is important to consider the methodology of the predictoras well as the focus of the study when interpreting deleteriousprediction results

3 Computational Methods forBeyond VCF Analyses

After a VCF file has been generated annotated and filteredthere are several types of analyses that can be performed(Figure 2) Here we outline six major types of analyses thatcan be performed after the generation of a VCF file withspecial focus on WES in cancer research (i) significantsomatic mutations (ii) pathway analysis (iii) copy numberestimation (iv) driver prediction (v) linking variants to clin-ical information and actionable therapies and (vi) emergingapplications of WES in cancer research

31 Methods to Determine Significant Somatic MutationsAfter VCF annotation a WES sample can have thousands ofSNVs identified however most of them will be silent (syn-onymous) mutations and will not be meaningful for follow-up study Therefore it is important to identify significantsomatic mutations from these variants Several tools havebeen developed to do this task for the analysis of cancerWESdata including SomaticSniper [50]MuTect [35] VarSim [51]and SomVarIUS [52]

SomaticSniper is a computational program that comparesthe normal and tumor samples to find out which mutationsare unique to the tumor sample hence predicted as somaticmutations [50] SomaticSniper uses the genotype likelihoodmodel of MAQ (as implemented in SAMtools) and thencalculates the probability that the tumor and normal geno-types are different The probability is reported as a somaticscore which is the Phred-scaled probability SomaticSniperhas been applied in various cancer research studies to detectsignificant somatic variants

Another popular somatic mutation identification tool isMuTect [35] developed by the Broad Institute MuTect likeSomaticSniper uses paired normal and cancer samples asinput for detecting somatic mutations After removing low-quality reads MuTect uses a variant detection statistic todetermine if a variant is more probable than a sequencingerror MuTect then searches for six types of known sequenc-ing artifacts and removes them A panel of normal samplesas well as the dbSNP database is used for comparison toremove common polymorphisms By doing this the numberof somatic mutations is not only identified but also reducedto a more probable set of candidate genes MuTect has beenwidely used in Broad Institute cancer genomics studies

While SomaticSniper andMuTect require data from bothpaired cancer and normal samples VarSim [51] and Som-VarIUS [52] do not require a normal sample to call somaticmutationsUnlikemost programs of its kindVarSim [51] usesa two-step process utilizing both simulation and experimen-tal data for assessing alignment and variant calling accuracyIn the first step VarSim simulates diploid genomes withgermline and somatic mutations based on a realistic modelthat includes SNVs and SVs In the second step VarSimperforms somatic variant detection using the simulated dataand validates the cancer mutations in the tumor VCF Som-VarIUS is another recent computational method to detectsomatic variants in cancer exomes without a normal pairedsample [52] In brief SomVarIUS consists of 3 steps forsomatic variant detection SomVarIUS first prioritizes poten-tial variant sites estimates the probability of a sequencingerror followed by the probability that an observed variantis germline or somatic In samples with greater than 150xcoverage SomVarIUS identifies somatic variants with at least677 precision and 646 recall rates when compared withpaired-tissue somatic variant calls in real tumor samples[52] Both VarSim and SomVarIUS will be useful for cancersamples that lack the corresponding normal samples forsomatic variant detection

32 Computational Tools for Estimating Copy Number Alter-ation One active research area in WES data analysis is thedevelopment of computational methods for estimating copynumber alterations (CNAs) Many tools have been developedfor estimatingCNAs fromWES data based on paired normal-tumor samples such as CNV-seq [53] SegSeq [54] ADTEx[55] CONTRA [56] EXCAVATOR [57] ExomeCNV [58]Control-FREEC (control-FREE Copy number caller) [59]and CNVseeqer [60] For example VarScan2 [61] is a com-putational tool that can estimate somatic mutations andCNAs from paired normal-tumor samples VarScan2 utilizes

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

12 International Journal of Genomics

a normal sample to find Somatic CNAs (SCNAs) by firstcomparing Q20 read depths between normal and tumorsamples and normalizes them based on the amount of inputdata for each sample [61] Copy number alteration is inferredfrom the log

2of the ratio of tumor depth to normal depth

for each region [61] Lastly the circular segmentation (CBS)algorithm [62] is utilized to merge adjacent segments to calla set of SCNAs These SCNAs could be further classifiedas large-scale (gt25 of chromosome arm) or focal (lt25)events in the WES data [63]

Recently ExomeAIwas developed to detect Allelic Imbal-ance (AI) from WES data [64] Utilizing heterozygous sitesExomeAI finds deviations from the expected 1 1 ratio be-tween an A- and B-allele inmultiple tumor samples without anormal comparison Absolute deviation of B-allele frequencyfrom 05 is calculated and similar to VarScan2 the CBSalgorithm is applied to each chromosomal arm [62] In orderto reduce the number of false positives a databasewas createdwith 500 (and counting) normal samples to filter out knownAIs This represents a novel tool to analyze WES for thedetection of recurrent AI events without matched normalsamples

A systematic evaluation of somatic copy number esti-mation tools for WES data has been recently published[63] In this study six computational tools for CNAs detec-tion (ADTEx CONTRA Control-FREEC EXCAVATORExomeCNV and VarScan2) were evaluated using WESdata from three TCGA datasets Using a SNP array asthe reference this study found that these algorithms gavehighly variable results The authors found that ADTEx andEXCAVATOR had the best performance with relatively highprecision and sensitivity when compared to the reference setThe study showed that the current CNA detection tools forWES data still have limitations and called for more robustalgorithms for this challenging task

33 Computational Tools for Predicting Drivers in CancerExomes Cancer is a disease driven by genetic variations andcopy number alterations These genetic events can be clas-sified into two classes ldquodriverrdquo and ldquopassengerrdquo mutationsDriver mutations are the key mutation that drive the devel-opment of cancer and provide a survival advantage whereaspassenger mutations are ldquoby-standerrdquo alterations that happento be altered in the primary cells but do not provide a survivaladvantage As the cancer exomes tend to have high muta-tional burdens identifying the ldquodriverrdquo mutations from theldquopassengerrdquo mutations is one of the key analyses in cancerresearch Several tools have been developed to find drivermutations including but not limited toCHASM[65] Dendrix[66] and MutSigCV [67]

CHASM (Cancer-specific High-throughput Annotationof Somatic Mutations) uses random forest as the machinelearning approach to distinguish the difference betweendriver and passenger mutations in cancer [65] CHASM wastrained on the curated driver mutations obtained from theCOSMIC database (ldquopositive examplesrdquo) and synthetic pas-senger mutations generated according to the background ofbase substitution frequencies observed in specific tumor

types (ldquonegative examplesrdquo) CHASM can achieve high sen-sitivity and specificity when discriminating between knowndriver missense mutations and randomly generated missensemutations when tested in real tumor samples This methodhas been one of the popular driver detection prediction toolsfor cancer researchers and has been applied in various cancergenomic studies

Another common driver mutation tool is MutSigCVdeveloped to resolve the problem of extensive false-positivefindings that overshadow true driver mutations [67] As thesize of cancer genomes sequenced has increased implausiblegenes (such as TTN) have been falsely reported as beingrelated to cancer when in fact their large size just makesthe probability they would be mutated by chance increase[67] MutSigCV takes into account patient-specific mutationfrequency and spectrum as well as gene-specific backgroundmutation rates expression level and replication time Bypooling all of this available data into one tool MutSigCV hasbecome a standard tool used for driver mutation identifica-tion in cancer studies

De novo Driver Exclusivity (Dendrix) is a novel com-putational tool to determine de novo driver pathways (genesets) from somatic mutations in patient data [66] The maingoal of the Dendrix algorithm is to find gene sets with highcoverage and high exclusivity properties from the somaticdataThe high coverage property assumes most patients haveat least one driver mutation in the gene set whereas the highexclusivity property assumes that these driver mutations arerarely mutated together in the same patient Two algorithmswere developed in Dendrix one based on a greedy algorithmand one based on the Markov Chain Monte Carlo (MCMC)algorithm to measure sets of genes that exhibit both criteriaWhenDendrix was applied to the TCGAdata the algorithmsidentified groups of genes that were mutated in large subsetsof patients and these mutations were mutually exclusiveThistool provides an opportunity to analyze WES data to identifydriver pathways in cancer genomic studies

34 Methods for Pathway Analysis After candidate somaticmutations have been identified one common type of analysisis to determine which pathways are affected by these muta-tions Common pathway resources and tools used for thesetypes of analysis include KEGG [68] DAVID [69] STRING[70] BEReX [71] DAPPLE [72] and SNPsea [73]

KEGG represents one of the most popular databasesfor pathway analysis DAVID is a popular online tool forperforming functional enrichment analysis based on userdefined gene lists STRING is the largest protein-proteininteractions database for querying and searching for inter-actions between user defined gene lists BEReX integratesSTRING KEGG and other data sources to explore biomedi-cal interactions between genes drugs pathways and diseasesBoth STRING and BEReX allow users to perform functionalenrichment analysis and the flexibility to explore the inter-actions between user defined gene lists by expanding thenetworks

DAPPLE (Disease Association Protein-Protein LinkEvaluator) uses literature reported protein-protein interac-tions to identify significant physical connectivity among the

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 13

genes of interest [72] DAPPLE hypothesizes that geneticvariation affects underlying mechanisms only detectable byprotein-protein interactions [72] SNPsea is another pathwayanalysis tool that requires specific SNP data [73] SNPseacalculates linkage disequilibriumbetween involved genes anduses a sampling approach to determine conditions that areaffected by these interactions

35 Computational Tools for Linking Variants to TreatmentsThe ability to link variants with actionable drug targets isan emerging research topic in precision medicine Databasessuch as My Cancer Genome have provided the frameworkfor these studies (httpswwwmycancergenomeorg) MyCancer Genome provides a bridge between genomic data andclinical therapeutic treatments Similarly ClinVar providesinformation on the relationship between variants and clinicaltherapy [74] By collecting both the variants and the clinicalsignificance related to these variants ClinVar offers a databasefor researchers to explore the significance of sequencing find-ings in the clinical setting [74] Pharmacological databasessuch as PharmGKB [75] DrugBank [76] and DSigDB [77]provide the link between drug and drug targets (variants)For example by querying a list of variants to one of thesedatabases it allows users to identify actionable targets viaenrichment analysis for the repurposing of drugs

Similarly the ability to incorporate clinical data intosequencing studies is vital to the advancement of person-alized medicine However due to the lack of integrationbetween electronic health records (EHR) andmolecular anal-ysis this remains one of the challenges in translating WESdata analysis into clinical practice Projects such as cBioPortalprovide a framework for incorporating sequencing data withavailable clinical data [78] New methods for addressing thistask are urgently needed to take advantage of the importantapplications ofWES data within the clinic in order to advanceprecision medicine

4 WES Analysis Pipelines

WES data analysis pipelines integrate computational toolsand methods described in the previous sections in a singleanalysis workflow Here we review three recent sequencingpipelines SeqMule [79] Fastq2vcf [80] and IMPACT [81] thatassimilate some of the tools described in previous sections

SeqMule stands out in part due to the use of five align-ment tools (BWA Bowtie 1 and 2 SOAP2 and SNAP) andfive different variant calling algorithms (GATK SAMtoolsVarScan2 FreeBayes and SOAPsnp) [79] SeqMule containsat least one feature that performs Pre-VCF analyses togenerate a filtered VCF file SeqMule also generates anaccompanying HTML-based report and images to showan overview of every step in the pipeline Fastq2vcf alsoperforms the Pre-VCF analyses using BWA as an alignmenttool and variant calling by GATK UnifiedGenotyper Haplo-typeCaller SAMtools and SNVer resulting in a filtered VCFafter implementation of ANNOVAR and VEP [80] Fastq2vcfcan be used in a single or parallel computing environment onvariety of sequencing data

Both SeqMule and Fastq2vcf pipelines focus on takingraw sequencing data and converting it into a filtered VCF fileIMPACT (Integrating Molecular Profiles with ACtionableTherapeutics) WES data analysis pipeline was developed totake this analysis a step further by linking a filtered VCF toactionable therapeutics [81] The IMPACT pipeline containsfour analytical modules detecting somatic variants callingcopy number alterations predicting drugs against the dele-terious variants and tumor heterogeneity analysis IMPACThas been applied to longitudinal samples obtained from amelanoma patient and identified novel acquired resistancemutations to treatment IMPACT analysis revealed loss ofCDKN2A as a novel resistance mechanism to the combina-tion of dabrafenib and trametinib treatment and predictedpotential drugs for further pharmacological and biologicalstudies [81]

To compare the strengths and weaknesses between thesethree WES pipelines SeqMule allows the use of differentalignment algorithms in its pipeline whereas IMPACT andFastq2vcf only utilize BWA as the sequencing alignmentalgorithm SAMtools is the common tool used by IMPACTFastq2vcf and SeqMule to call variants In addition Fastq2vcfand SeqMule employ GATK and other variant calling algo-rithms for variants detection Fastq2vcf and IMPACT bothannotate the variants withANNOVAR Fastq2vcf also utilizesVEP and IMPACT utilizes SIFT and PolyPhen-2 as theprimary variants prediction methods For Post-VCF analysisIMPACT pipeline has more options as compared to SeqMuleand Fastq2vcf In particular IMPACT performs copy numberanalysis tumor heterogeneity and linking of actionabletherapeutics to the molecular profiles However IMPACTis only designed to be performed on tumor samples whileSeqMule and Fastq2vcf are designed for any WES datasetTherefore it is advisable for the users to consider the analyticneeds to select the appropriateWES data analysis pipeline fortheir research

As recently discussed by Altman et al part of the USPrecision Medicine Initiative (PMI) includes being able todefine a gold standard of pipelines and tools for specificsequencing studies to enable a new era of medicine [82]Automated pipelines such as these will accelerate the analysisand interpretation of WES data Future development of dataanalysis pipeline will be needed to incorporate newer andwider tools tailored for specific research questions

5 Conclusions

In summary we have reviewed several computational toolsfor the analysis and interpretation of WES data Thesecomputationalmethods were developed to generate VCF filesfrom raw sequencing data as well as tools that performdownstream analyses in WES studies Each tool has specificstrengths and weaknesses and it appears that using severalof them in combination would lead to more accurate resultsCurrently there are still challenges for bioinformaticians atevery step in analyzing WES data However the greatestarea of need is in the development of tools that can linkthe information found in a VCF file to clinical databasesand therapeutics Research in this area will help to advance

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 14: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

14 International Journal of Genomics

precision medicine by providing user-friendly and informa-tive knowledge to transcend the laboratory

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank the Translational Bioin-formatics and Cancer Systems Biology Lab members fortheir constructive comments on this manuscript This workis partly supported by the National Institutes of HealthP50CA058187 Cancer League of Colorado the David F andMargaret T Grohne Family Foundation the Rifkin EndowedChair (WAR) and the Moore Family Foundation

References

[1] M LMetzker ldquoSequencing technologiesmdashthe next generationrdquoNature Reviews Genetics vol 11 no 1 pp 31ndash46 2010

[2] S Goodwin J D McPherson and W R McCombie ldquoComingof age ten years of next-generation sequencing technologiesrdquoNature Reviews Genetics vol 17 no 6 pp 333ndash351 2016

[3] M Lek K J Karczewski E V Minikel et al ldquoAnalysis of pro-tein-coding genetic variation in 60706 humansrdquo Nature vol536 no 7616 pp 285ndash291 2016

[4] A McKenna M Hanna E Banks et al ldquoThe genome analysistoolkit a MapReduce framework for analyzing next-generationDNA sequencing datardquo Genome Research vol 20 no 9 pp1297ndash1303 2010

[5] M A DePristo E Banks R Poplin et al ldquoA framework forvariation discovery and genotyping using next-generationDNAsequencing datardquo Nature Genetics vol 43 no 5 pp 491ndash4982011

[6] G A Van der M O Auwera C Hartl et al ldquoFrom FastQ datato high confidence variant calls the Genome Analysis Toolkitbest practices pipelinerdquo Current Protocols in Bioinformatics vol11 no 1110 pp 11101ndash111033 2013

[7] H Li B Handsaker A Wysoker et al ldquoThe Sequence Align-mentMap format and SAMtoolsrdquo Bioinformatics vol 25 no16 pp 2078ndash2079 2009

[8] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 Article ID btp698 pp 589ndash595 2010

[9] B Langmead C Trapnell M Pop and S L Salzberg ldquoUltrafastand memory-efficient alignment of short DNA sequences tothe human genomerdquo Genome Biology vol 10 no 3 article R252009

[10] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[11] SMarco-SolaM Sammeth RGuigo andP Ribeca ldquoTheGEMmapper fast accurate and versatile alignment by filtrationrdquoNature Methods vol 9 no 12 pp 1185ndash1188 2012

[12] T D Wu and S Nacu ldquoFast and SNP-tolerant detection ofcomplex variants and splicing in short readsrdquo Bioinformaticsvol 26 no 7 pp 873ndash881 2010

[13] H Li J Ruan and R Durbin ldquoMapping short DNA sequenc-ing reads and calling variants using mapping quality scoresrdquoGenome Research vol 18 no 11 pp 1851ndash1858 2008

[14] C Alkan J M Kidd T Marques-Bonet et al ldquoPersonalizedcopy number and segmental duplication maps using next-generation sequencingrdquo Nature Genetics vol 41 no 10 pp1061ndash1067 2009

[15] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[16] R Li C Yu Y Li et al ldquoSOAP2 an improved ultrafast tool forshort read alignmentrdquo Bioinformatics vol 25 no 15 pp 1966ndash1967 2009

[17] Z Ning A J Cox and J C Mullikin ldquoSSAHA a fast searchmethod for large DNA databasesrdquo Genome Research vol 11 no10 pp 1725ndash1729 2001

[18] G Lunter and M Goodson ldquoStampy a statistical algorithm forsensitive and fast mapping of Illumina sequence readsrdquoGenomeResearch vol 21 no 6 pp 936ndash939 2011

[19] V L Galinsky ldquoYOABS yet other aligner of biological sequen-cesmdashan efficient linearly scaling nucleotide alignerrdquo Bioinfor-matics vol 28 no 8 Article ID bts102 pp 1070ndash1077 2012

[20] H Li and N Homer ldquoA survey of sequence alignment algo-rithms for next-generation sequencingrdquo Briefings in Bioinfor-matics vol 11 no 5 Article ID bbq015 pp 473ndash483 2010

[21] J Shendure and H Ji ldquoNext-generation DNA sequencingrdquoNature Biotechnology vol 26 no 10 pp 1135ndash1145 2008

[22] T J Treangen and S L Salzberg ldquoRepetitive DNA and next-generation sequencing computational challenges and solu-tionsrdquo Nature Reviews Genetics vol 13 no 1 pp 36ndash46 2012

[23] H Xu X Luo J Qian et al ldquoFastUniq a fast de novo duplicatesremoval tool for paired short readsrdquo PLoS ONE vol 7 no 12Article ID e52249 2012

[24] S Pabinger A Dander M Fischer et al ldquoA survey of tools forvariant analysis of next-generation genome sequencing datardquoBriefings in Bioinformatics vol 15 no 2 pp 256ndash278 2014

[25] D Shigemizu A Fujimoto S Akiyama et al ldquoA practicalmethod to detect SNVs and indels from whole genome andexome sequencing datardquo Scientific Reports vol 3 article 21612013

[26] A Rimmer H Phan I Mathieson et al ldquoIntegratingmapping-assembly- and haplotype-based approaches for calling variantsin clinical sequencing applicationsrdquoNature Genetics vol 46 no8 pp 912ndash918 2014

[27] E Garrison and G Marth ldquoHaplotype-based variant detectionfrom short-read sequencingrdquo httpsarxivorgabs12073907

[28] G Ellison S Huang H Carr et al ldquoA reliable method for thedetection of BRCA1 and BRCA2 mutations in fixed tumourtissue utilising multiplex PCR-based targeted next generationsequencingrdquo BMC Clinical Pathology vol 15 no 1 article 52015

[29] A K Talukder S Ravishankar K Sasmal et al ldquoXomAnnotateanalysis of heterogeneous and complex exomemdasha step towardstranslational medicinerdquo PLoS ONE vol 10 no 4 Article IDe0123569 2015

[30] K Ye M H Schulz Q Long R Apweiler and Z Ning ldquoPindela pattern growth approach to detect break points of largedeletions and medium sized insertions from paired-end shortreadsrdquo Bioinformatics vol 25 no 21 pp 2865ndash2871 2009

[31] E Karakoc C Alkan B J OrsquoRoak et al ldquoDetection of structuralvariants and indels within exome datardquo Nature Methods vol 9no 2 pp 176ndash178 2012

[32] A Ratan T L Olson T P Loughran and W Miller ldquoIden-tification of indels in next-generation sequencing datardquo BMCBioinformatics vol 16 no 1 article 42 2015

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 15: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

International Journal of Genomics 15

[33] Z Zhang J Wang J Luo et al ldquoSprites detection of deletionsfrom sequencing data by re-aligning split readsrdquoBioinformaticsvol 32 no 12 pp 1788ndash1796 2016

[34] K Wang M Li and H Hakonarson ldquoANNOVAR functionalannotation of genetic variants from high-throughput sequenc-ing datardquo Nucleic Acids Research vol 38 no 16 article e1642010

[35] K Cibulskis M S Lawrence S L Carter et al ldquoSensitive detec-tion of somatic point mutations in impure and heterogeneouscancer samplesrdquoNature Biotechnology vol 31 no 3 pp 213ndash2192013

[36] P Cingolani A Platts L L Wang et al ldquoA program for anno-tating and predicting the effects of single nucleotide polymor-phisms SnpEff SNPs in the genome ofDrosophila melanogasterstrain w1118 iso-2 iso-3rdquo Fly vol 6 no 2 pp 80ndash92 2012

[37] P Cingolani V M Patel M Coon et al ldquoUsing Drosophilamelanogaster as a model for genotoxic chemical mutationalstudies with a new program SnpSiftrdquo Frontiers in Genetics vol3 article 35 2012

[38] L Habegger S Balasubramanian D Z Chen et al ldquoVAT acomputational framework to functionally annotate variants inpersonal genomes within a cloud-computing environmentrdquoBioinformatics vol 28 no 17 pp 2267ndash2269 2012

[39] S T SherryM-HWardMKholodov et al ldquoDbSNP theNCBIdatabase of genetic variationrdquo Nucleic Acids Research vol 29no 1 pp 308ndash311 2001

[40] I F A C Fokkema J T den Dunnen and P E M TaschnerldquoLOVD easy creation of a locus-specific sequence variationdatabase using an lsquoLSDB-in-a-boxrsquo approachrdquo Human Muta-tion vol 26 no 2 pp 63ndash68 2005

[41] 1000Genomes Project ConsortiumG R Abecasis D Altshuleret al ldquoAmapof human genome variation frompopulation-scalesequencingrdquo Nature vol 467 no 7319 pp 1061ndash1073 2010

[42] S A Forbes D Beare P Gunasekaran et al ldquoCOSMIC explor-ing the worldrsquos knowledge of somatic mutations in humancancerrdquo Nucleic Acids Research vol 43 no 1 pp D805ndashD8112015

[43] P Kumar S Henikoff and P C Ng ldquoPredicting the effects ofcoding non-synonymous variants on protein function using theSIFT algorithmrdquo Nature Protocols vol 4 no 7 pp 1073ndash10812009

[44] I A Adzhubei S Schmidt L Peshkin et al ldquoA method andserver for predicting damaging missense mutationsrdquo NatureMethods vol 7 no 4 pp 248ndash249 2010

[45] S Chun and J C Fay ldquoIdentification of deleterious mutationswithin three human genomesrdquo Genome Research vol 19 no 9pp 1553ndash1561 2009

[46] HA Shihab J GoughDNCooper et al ldquoPredicting the func-tional molecular and phenotypic consequences of amino acidsubstitutions using hidden Markov modelsrdquo Human Mutationvol 34 no 1 pp 57ndash65 2013

[47] C Dong P Wei X Jian et al ldquoComparison and integrationof deleteriousness prediction methods for nonsynonymousSNVs in whole exome sequencing studiesrdquo Human MolecularGenetics vol 24 no 8 pp 2125ndash2137 2015

[48] H Carter C Douville P D Stenson D N Cooper and RKarchin ldquoIdentifying Mendelian disease genes with the varianteffect scoring toolrdquo BMC Genomics vol 14 p S3 2013

[49] M Kircher D M Witten P Jain B J OrsquoRoak G M Cooperand J Shendure ldquoA general framework for estimating the rela-tive pathogenicity of human genetic variantsrdquo Nature Geneticsvol 46 no 3 pp 310ndash315 2014

[50] D E Larson C C Harris K Chen et al ldquoSomaticsniperidentification of somatic point mutations in whole genomesequencing datardquo Bioinformatics vol 28 no 3 Article IDbtr665 pp 311ndash317 2012

[51] J C Mu M Mohiyuddin J Li et al ldquoVarSim a high-fidelitysimulation and validation framework for high-throughputgenome sequencing with cancer applicationsrdquo Bioinformaticsvol 31 no 9 pp 1469ndash1471 2014

[52] K S Smith V K Yadav S Pei D A Pollyea C T Jordan and SDe ldquoSomVarIUS somatic variant identification from unpairedtissue samplesrdquo Bioinformatics vol 32 no 6 pp 808ndash813 2015

[53] C Xie and M T Tammi ldquoCNV-seq a new method to detectcopy number variation using high-throughput sequencingrdquoBMC Bioinformatics vol 10 no 1 article 80 2009

[54] D Y Chiang G Getz D B Jaffe et al ldquoHigh-resolution map-ping of copy-number alterationswithmassively parallel sequen-cingrdquo Nature Methods vol 6 no 1 pp 99ndash103 2009

[55] K C Amarasinghe J Li S M Hunter et al ldquoInferring copynumber and genotype in tumour exome datardquo BMC Genomicsvol 15 no 1 article 732 2014

[56] J Li R Lupat K C Amarasinghe et al ldquoCONTRA copynumber analysis for targeted resequencingrdquo Bioinformatics vol28 no 10 Article ID bts146 pp 1307ndash1313 2012

[57] A Magi L Tattini I Cifola et al ldquoEXCAVATOR detectingcopy number variants from whole-exome sequencing datardquoGenome Biology vol 14 no 10 article R120 2013

[58] J F Sathirapongsasuti H Lee B A J Horst et al ldquoExomesequencing-based copy-number variation and loss of heterozy-gosity detection ExomeCNVrdquo Bioinformatics vol 27 no 19Article ID btr462 pp 2648ndash2654 2011

[59] V Boeva A Zinovyev K Bleakley et al ldquoControl-free callingof copy number alterations in deep-sequencing data using GC-content normalizationrdquo Bioinformatics vol 27 no 2 pp 268ndash269 2011

[60] Y Jiang D Redmond K Nie et al ldquoDeep sequencing revealsclonal evolution patterns and mutation events associated withrelapse in B-cell lymphomasrdquo Genome Biology vol 15 no 8article 432 2014

[61] D C Koboldt Q Zhang D E Larson et al ldquoVarScan 2 somaticmutation and copy number alteration discovery in cancer byexome sequencingrdquo Genome Research vol 22 no 3 pp 568ndash576 2012

[62] A B Olshen H Bengtsson P Neuvial P T Spellman R AOlshen and V E Seshan ldquoParent-specific copy number inpaired tumor-normal studies using circular binary segmenta-tionrdquo Bioinformatics vol 27 no 15 Article ID btr329 pp 2038ndash2046 2011

[63] J-Y Nam N K Kim S C Kim et al ldquoEvaluation of somaticcopy number estimation tools for whole-exome sequencingdatardquo Briefings in Bioinformatics vol 17 no 2 pp 185ndash192 2015

[64] J Nadaf J Majewski and S Fahiminiya ldquoExomeAI detectionof recurrent allelic imbalance in tumors using whole-exomesequencing datardquo Bioinformatics vol 31 no 3 pp 429ndash4312014

[65] H Carter J Samayoa R H Hruban and R Karchin ldquoPrioriti-zation of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations(CHASM)rdquo Cancer Biology amp Therapy vol 10 no 6 pp 582ndash587 2010

[66] F Vandin E Upfal and B J Raphael ldquoDe novo discovery ofmutated driver pathways in cancerrdquo Genome Research vol 22no 2 pp 375ndash385 2012

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 16: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

16 International Journal of Genomics

[67] M S Lawrence P Stojanov P Polak et al ldquoMutational het-erogeneity in cancer and the search for new cancer-associatedgenesrdquo Nature vol 499 no 7457 pp 214ndash218 2013

[68] M Kanehisa and S Goto ldquoKEGG kyoto encyclopedia of genesand genomesrdquo Nucleic Acids Research vol 28 no 1 pp 27ndash302000

[69] D W Huang B T Sherman and R A Lempicki ldquoBioin-formatics enrichment tools paths toward the comprehensivefunctional analysis of large gene listsrdquo Nucleic Acids Researchvol 37 no 1 pp 1ndash13 2009

[70] D Szklarczyk A Franceschini S Wyder et al ldquoSTRING v10protein-protein interaction networks integrated over the treeof liferdquo Nucleic Acids Research vol 43 no 1 pp D447ndashD4522015

[71] M Jeon S Lee K Lee A-C Tan and J Kang ldquoBEReX biomed-ical entity-relationship eXplorerrdquo Bioinformatics vol 30 no 1pp 135ndash136 2014

[72] E J Rossin K Lage S Raychaudhuri et al ldquoProteins encodedin genomic regions associated with immune-mediated diseasephysically interact and suggest underlying biologyrdquo PLoSGenet-ics vol 7 no 1 Article ID e1001273 2011

[73] K Slowikowski X Hu and S Raychaudhuri ldquoSNPsea analgorithm to identify cell types tissues and pathways affectedby risk locirdquo Bioinformatics vol 30 no 17 pp 2496ndash2497 2014

[74] M J Landrum J M Lee G R Riley et al ldquoClinVar publicarchive of relationships among sequence variation and humanphenotyperdquo Nucleic Acids Research vol 42 no 1 pp D980ndashD985 2014

[75] M Whirl-Carrillo E M McDonagh J M Hebert et alldquoPharmacogenomics knowledge for personalized medicinerdquoClinical Pharmacology and Therapeutics vol 92 no 4 pp 414ndash417 2012

[76] V Law C Knox Y Djoumbou et al ldquoDrugBank 40 sheddingnew light on drug metabolismrdquo Nucleic Acids Research vol 42no 1 pp D1091ndashD1097 2014

[77] M Yoo J Shin J Kim et al ldquoDSigDB drug signatures databasefor gene set analysisrdquo Bioinformatics vol 31 no 18 pp 3069ndash3071 2014

[78] E Cerami J GaoUDogrusoz et al ldquoThe cBio cancer genomicsportal an open platform for exploringmultidimensional cancergenomics datardquo Cancer Discovery vol 2 no 5 pp 401ndash4042012

[79] Y Guo X Ding Y Shen G J Lyon and K Wang ldquoSeq-Mule automated pipeline for analysis of human exomegenomesequencing datardquo Scientific Reports vol 5 article 14283 2015

[80] X Gao J Xu and J Starmer ldquoFastq2vcf a concise and transpar-ent pipeline for whole-exome sequencing data analysesrdquo BMCResearch Notes vol 8 no 1 p 72 2015

[81] J Hintzsche J Kim V Yadav et al ldquoIMPACT a whole-exomesequencing analysis pipeline for integrating molecular profileswith actionable therapeutics in clinical samplesrdquo Journal of theAmericanMedical Informatics Association vol 23 no 4 pp 721ndash730 2016

[82] R B Altman S Prabhu A Sidow et al ldquoA research roadmap fornext-generation sequencing informaticsrdquo Science TranslationalMedicine vol 8 no 335 Article ID 335ps10 2016

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 17: Review Article A Survey of Computational Tools to Analyze ...downloads.hindawi.com/journals/ijg/2016/7983236.pdf · and management [, ] . Whole Exome Sequencing (WES) is the application

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology