finding and calling genome variantsbarc.wi.mit.edu/education/hot_topics/genomevariants_jul... ·...
TRANSCRIPT
FindingandCallingGenomeVariants
Outline• Genomevariantsoverview• Miningvariantsfromdatabases
! dbSNP! HapMap! 1000Genomes! Disease/Clinicalvariantsdatabases
• Callingvariantsusingyourowndata! GATKbestpracGces! Samtools(mpileup/bcIools)
2
GenomicVariaGon• PopulaGongeneGcs
" Measure/explaindiversity/heritability
• DiseasesuscepGbility" GWAS" Biomarkers
• VariantsmaycauseaparGculartrait" Regulatoryelement(eg.promoter,enhancer,3’UTRetc.)" Proteincodingsequence(eg.silent,missense,ornonsensemutaGon)
Palstra,RJ.etal(2012)hYp://evoluGon.berkeley.edu/evolibrary/arGcle/mutaGons_06
3
GenomicVariaGon:SequencevsStructuralVariaGon
• SequenceVariants
• StructuralVariants(>50basesormore)
hYp://www.ensembl.org/info/genome/variaGon
Type DescripGon Example(Reference/AlternaGve)
SNP SingleNucleoGdePolymorphism Ref:...TTGACGTA... Alt:...TTGGCGTA...
Inser+on InserGonofoneorseveralnucleoGdes Ref:...TTGACGTA... Alt:...TTGATGCGTA...
Dele+on DeleGonofoneorseveralnucleoGdes Ref:...TTGACGTA... Alt:...TTGGTA...
Subs+tu+on AsequencealteraGonwherethelengthofthechangeinthevariantisthesameasthatofthereference.
Ref:...TTGACGTA... Alt:...TTGTAGTA...
Type DescripGon Example(Reference/AlternaGve) CNV
CopyNumberVariaGon:increasesordecreasesthecopynumberofagivenregion
"Gain"ofonecopy: "Loss"ofonecopy:
Inversion AconGnuousnucleoGdesequenceisinvertedinthesameposiGon
Transloca+on AregionofnucleoGdesequencethathastranslocatedtoanewposiGon(eg.BCR-ABLfusiongene)
4
GenomeVariaGon:IndividualandPopulaGon
• SingleNucleoGdePolymorphisms(SNP)– MAF*>1%commonSNP– MAF*<1%rareSNP– SomedefiniGonsuse5%asthreshold
• Onaverageonevariantevery1200bases(basedonHapMap)
*MinorAlleleFrequency5
GenomeVariaGon:Reference
Organism Descrip+on/Strain Assembly*Human DNAisolatedfromWBCof4anonymousindividuals
(2malesand2females).However,themajorityofthesequencecamefromoneofthemaledonors
GRCh37/GRCh38
Mouse C57BL/6J GRCm37/GRCm38C.elegans N2 WormBasevWS220Fruitfly ISO1 BDGPRelease5Yeast S288C SGDFeb2011A.thaliana Colecotype TAIR10
*Availablein/nfs/genomes 6
Describing/AnnotaGngVariants• Generalguidelines*
" noposiGon0" rangeindicatedby“_”(eg.586_591)
• DNA" g.957A>T(toincludechromosomeusechr9:g.957A>T)" g.413delG" g.451_452insT" InCDS,
! c.23G>C! +1isAofATG(startcodon);-1istheprevious/upstreamnucleoGde! “*”isthestopcodon(eg.*1isthefirstnucleoGdeofthestopcodon)
• RNA" r.957a>u
• Protein(three/oneleYeraa)" p.His78Gln
*Forcompletelist/guidelinesseehgvs.org
ChrPosi+on Ref AltSourceg.change:rsID:Depth=AvgSampleReadDepth:Func+onGVS:hgvsProteinVariant1689824989 G T EVS g.89824989G>T:rs140823801:Depth=141:missense:p.Q993K
7
GenomeVariaGonDatabases:dbSNP
• RepositoryforSNPsandshortsequencevariaGon(<50bases)• Currentbuild:dbSNP150(Feb2017)
" Approx.135Mvalidatedrs#’sforhuman!MostlygermlinemutaGons(smallersubsetofsomaGc) #!Containsrarevariantsaswell #
" Variousorganisms(Supportfornon-humanorganismsendingSept1st.)
• EachSNP,orrecord,isidenGfiedbyanrs#thatincludes" SummaryaYributes" NCBIresources(linkedtoClinVar,GenBank,etc.)" Externalresources(linkedtoOMIMandNHGRIGWAS)
• SubmissionsaremadefrompubliclaboratoriesandprivateorganizaGons(ss#’s),andidenGcalrecordsareclusteredintoasinglerecord(rs#’s).
• rsidissamefordifferentassemblies(eg.GRCh37/38),butchromosomalcoordinatesmaydiffer!
8
Hands-on:dbSNP
• Miningvariantsfromdatabases• FindingSNPsforyourfavoritegeneindbSNP
9
GenomeVariaGonDatabases:1000GenomesProject
• ExtensionoftheHapMapin2008tocataloguegeneGcvariaGonbysequencingatleast1000parGcipants
• DiscoverpopulaGonlevelhumangeneGcvariaGons• IniGallyconsistedofwholegenomelowcoverage
(4X)andhighcoverageexome(20X)sequencing• VCFformatwasdeveloped,andiniGally
maintained,fortheproject• Phase3containsWGSdatafor2504individuals
across26populaGons.
hYp://www.internaGonalgenome.org/ 10
MiningDisease/ClinicalVariantsDatabase Link
CatalogofPublishedGWAS(NHGRI) hYps://www.ebi.ac.uk/gwas/
GWASCentral gwascentral.org
ClinVar(NCBI) ncbi.nlm.nih.gov/clinvar
PheGenI(NCBI) ncbi.nlm.nih.gov/gap/phegeni
SNPedia snpedia.com
11
MiningDisease/ClinicalVariantsinCancer:COSMIC
• hYp://cancer.sanger.ac.uk/cosmic• CatalogofSomaGcMutaGonsinCancer(COSMIC)
createdin2005• v70(Aug2014)had~2McodingpointmutaGons• Datasetsarecuratedfrompublishedliteratureandotherdatabases(eg.TCGA,ICGC)
• AvailableinbothGRCh37/38coordinates• Tools/Features"CancerGeneCensus(currently572genes)"Browser:Cancer/CellLine"COSMICMart(similartoBioMart) 12
Callingvariantsfromsequencedatarequires3broadsteps
Preparedata;QC,align,SAM->BAM,sort,removePCRduplicates
Annotateforfunc+on;snpEff,HaploReg,GTEx
Callvariants;basequalityscorecalibraIon,variantcall,qualityfiltering
1
3
2
13
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
14
Checkreadqualitywithfastqc
(hYp://www.bioinformaGcs.babraham.ac.uk/projects/fastqc/)
Alignreadstoreferencegenome• UseasensiGve(gapped)alignertoaccountforlargeindels
(BWA,hYp://bio-bwa.sourceforge.net/)*.
*SeeBaRCSOPsforusageinstrucGons. 15
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
16
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
PicardTools
17
ConvertSAM->BAMandsortreadsbycoordinates(hYps://broadinsGtute.github.io/picard/)
• PicardTools:AddOrReplaceReadGroups• SO=coordinate<-sortsmappedreadsbycoordinate.
• PicardTools:MarkDuplicates• Thiscommandflagsallduplicatereadsinfile.• ThisflagisrecognizedbysamtoolsmpileupandGATK
HaplotypeCaller.• Bydefault,readswiththistagwillbeignored.
18
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
PicardTools
19
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
20
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
21
Samtoolsmpileup
• ThempileupcommandscanseveryposiGonsupportedbyanalignedreadandrecordsthepossiblegenotypes.
• Moreover,everyGmeamappedreadhasamis-matchtothereferencegenome,itincorporatesinformaGon,suchasthenumberofreadsthatsharethemis-match,thequalityofthebaseatthatposiGon,andtheexpectedsequencingerrorrates.
• Itthencomputestheprobabilitythateachofthesegenotypesistrulypresentinthesample.
22
BasequalityscorerecalibraGon*(BQSR)
• QualityscoresproducedbysequencersaresubjecttosystemaGctechnicalerror,thatmayleadtoover-orunder-esGmatedbasequalityscores.
• BQSRisaprocessthatappliesmachinelearningtomodeltheseerrorsempiricallyandadjustthequalityscoresaccordingly.
• Forexample,foragivenrun,whentwoAnucleoGdesinarowarecalled,thenextbasecalledhada1%higherrateoferror.SoanybasecallthatcomesaIerAAinareadshouldhaveitsqualityscorereducedby1%.
*hYps://gatkforums.broadinsGtute.org/gatk/discussion/44/base-quality-score-recalibraGon-bqsr
23
CallingVariants
24
CallingVariants:QuesGonableCalls
25
CallingVariants:QuesGonableCalls
26
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
27
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
28
bcIoolscall
• ThebcNoolscallcommandusesthegenotypelikelihoodsgeneratedfromsamtoolsmpileuptocallvariants,andoutputsallidenGfiedvariantsinvariantcall(VCF)format.
29
GATKHaplotypeCaller
• WhenHaplotypeCallerencountersaread-mappedregionshowingsignsofvariaGon,itdiscardstheexisGngmappinginformaGonandcompletelyreassemblesthereadsinthatregion.
• ThisallowstheHaplotypeCallertobemoreaccuratewhencallingregionsthataretradiGonallydifficulttocall,forexamplewhentheycontaindifferenttypesofvariantsclosetoeachother.
• Foreachregion,itperformsapairwisealignmentofeachreadagainsteachhaplotype.Thisproducesamatrixoflikelihoodsofhaplotypes.ThemostlikelyalleleforeachposiGonisassigned.
• HaplotypeCallerisabletocorrectlyhandlethesplicejuncIonsthatmakeRNAseqachallengeformostvariantcallers.
30
VCFFormat
www.1000genomes.org
• VariantCallFormat(VCF);BCF$ binaryversionofVCF• Textfileformatwithmeta-informaGonandheaderlines,
followedbydatalinescontaininginformaGonaboutaposiGoninthegenome.
31
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
32
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
VariantQualityScoreRecalibraGon(VQSR)vcIoolsvcf-annotate Filtervariants
33
VcIooolsvcf-annotate
• VcNoolsvcf-annotateisawaytohardfilteryourcalledvariantsusing“standard”qualitythresholdsorthroughuser-specifiedthresholds.! vcf-annotate -f + myFile.vcf > myFile_annot.vcf
! “+”appliesseveralfilterswithdefaultvalues,eg.! QualINTMinimumvalueoftheQUALfield[10]! MinDPINTMinimumreaddepth[2]
34
GATKVariantqualityscorerecalibrator(VQSR)
• VQSRassignsawell-calibratedprobabilitytoeachvariantcallinacallsetwhichcanbeusedtofilterforhighqualityvariants.
• VQSRachievesthisbytakingareferencesetitassumestobe“true”variants(Hapmap)andbuildsadistribuGonoftheirqualitymetrics.Thisisusedtobuildamodelofwhata“true”variantshouldlooklike.
• ThismodelthenassignsarecalibraGonqualityscoretoyourvariants.Thehigherthisscore,thegreateritsfittothe“true”model.
• Thetoolallowsforthese�ngof“Tranches”orthresholdsthattheoreGcallyallowyoutorecover100%,99%,90%,etcoftheTruevariantsinthetrainingset.Youcanfilteryourresultsonthismetrictoachievegreater/reducedspecificity/sensiGvity.
35
GATKVariantqualityscorerecalibrator(VQSR)cont’d
• Caveats:• ThisproceduremustbeperformedforSNPsandINDELs
separately.• Itdoesnotworkfororganismsforwhichno“true/training”data
setsareavailable.• Thepowerofthismethodisdependentofthe#ofreads.Exome
and/orlowcoverageexperimentsmayproducemanylow-qualityvariantcalls.
• SeetheGATKbestpracGcesformoreinformaGononapplyingthis
method• hYps://soIware.broadinsGtute.org/gatk/documentaGon/
arGcle.php?id=2805
36
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
VariantQualityScoreRecalibraGon(VQSR)vcIoolsvcf-annotate Filtervariants
37
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
VariantQualityScoreRecalibraGon(VQSR)vcIoolsvcf-annotate Filtervariants
Assessforrare/commonvariants
38
AnnotatecommonSNPsinyourdata
• BedtoolsintersectcanbeusedtoannotatevariantsfromyourcallsetthatoverlapwithvariantsfoundindbSNP.• intersectBed-wao-split-aA_reads.bt2.sorted_unique.raw.vcf-b
SNP146.bed>A_reads.bt2.sorted_unique.annotated.vcf
39
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
VariantQualityScoreRecalibraGon(VQSR)vcIoolsvcf-annotate Filtervariants
Assessforrare/commonvariants
40
1)Preparedata
3)AnnotateforFuncGon
2)Callvariants
QCreads&Aligntoreference
Samtools GATK
PicardTools
Samtoolsmpileup BasequalityscorerecalibraGon(BQSR)Assessquality
BcIoolscall HaplotypeCallerCallvariants
VariantQualityScoreRecalibraGon(VQSR)vcIoolsvcf-annotate Filtervariants
Assessforrare/commonvariants
VariantAnnotaGons41
CallingVariants:AnnotaGon
• Annotatevariantswith(funcGonal)consequence
eg.chr12:g25232372A>Gisamissensevariant• PopulartoolsincludesnpEff,andVariantEffectPredictor(VEP)fromEnsembl• ChoiceofannotaGonmayaffectvariantannotaGon" RefSeq" Ensembl" GENCODE
42
AnnotaGonofnon-codingvariaGon
• HaploreghYp://archive.broadinsGtute.org/mammals/haploreg/haploreg.php
• SNPscanbevisualizedwith
• ChromaGnstateandproteinbindingannotaGonfromtheRoadmapEpigenomicsandENCODEprojects.
• SequenceconservaGonacrossmammals,theeffectofSNPsonregulatorymoGfs,andtheeffectofSNPsonexpressionfromeQTLstudies.
Hands-on:Haploreg
• IdenGfyingthepotenGalfuncGonofnon-codingvariants.
44
Hands-on:Samtools:ExamineCalledvariants
• AnalyzecalledvariantsinIGV.
45
BaRCSOP
• VariantcallingusingSamtoolsandGATK.ManipulaGng/interpreGngVCFfiles
hYp://barcwiki/wiki/SOPsunderVariantcallingandanalysis
46
ResourcesForMiningVariantsDatabase LinkdbSNP www.ncbi.nlm.nih.gov/SNP
HapMap hapmap.ncbi.nlm.nih.gov
1000Genomes 1000genomes.org
UK10K uk10k.org
ExomeVariantServer(EVS) evs.gs.washington.edu/EVS
PersonalGenomeProject(Harvard) personalgenomes.org
ExACBrowser(Broad) exac.broadinsGtute.org
47
ResourcesForMiningVariants:Cancer
Database LinkInternaGonalCancerGenomeConsorGum(ICGC)
icgc.org
CatalogueofSomaGcMutaGoninCancer(COSMIC)
cancer.sanger.ac.uk
cBioPortalforCancerGenomics cbioportal.org
CancerCellLineEncyclopedia(CCLE) broadinsGtute.org/ccle
48
ResourcesForMiningVariants:Plants
• 1001Genomes(A.thaliana1001strains)" 1001genomes.org
• 1000Genomes(large-scalegenesequencingofatleast1000plantspecies)" www.onekp.com
49
VariantCallingworkflow
• PleaseseeourVariantCallingwalkthroughexercisehere:• hYp://jura.wi.mit.edu/bio/educaGon/hot_topics/GenomeVariants_Jul2017/Genome_Variant_calling_walkthrough.txt
• WithinyouwillfindthecommandsrequiredforcallingvariantswithbothsamtoolsandGATK.
50