ukb : ukb homepage - imputation documentation...

Post on 06-Apr-2021

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

UKBiobankPhasingandImputationDocumentation

Version1.2

13November2015

documentationauthorJonathanMarchiniDepartmentofStatistics,UniversityofOxford

onbehalfofUKBiobank

ContributorstoUKBiobankPhasingandImputationJonathanMarchini(StatisticsDept,Oxford),JaredO’Connell(WTCHG,Oxford),OlivierDelaneau(UniversityofGeneva),KevinSharp(StatisticsDept,Oxford),WarrenKretzschmar(WTCHG,Oxford),GavinBand(WTCHG,Oxford),ShaneMcCarthy(WTSI,Hinxton),DesislavaPetkova(WTCHG,Oxford),ClaireBycroft(WTCHG,Oxford),ColinFreeman(WTCHG,Oxford),PeterDonnelly(WTCHG,Oxford).

2

TableofContents

Introduction.............................................................................................................................3

Phasing......................................................................................................................................4Filteringbeforephasing...............................................................................................................4Phasingmethoddescription.......................................................................................................4Validationofthephasingmethod.............................................................................................5Wholegenomephasing.................................................................................................................5

Genotypeimputation............................................................................................................6AssessmentoftheUKBiobankArrayforimputation........................................................6Referencepanelusedforimputation......................................................................................7Imputationmethoddescription................................................................................................8Wholegenomeimputation..........................................................................................................8Informationscores,minorallelefrequenciesandfiltering.............................................8Imputedgenotypefiles.................................................................................................................9Samplefiles....................................................................................................................................................10

Differencesbetweenrawgenotypesandimputedfiles...................................................10Anexemplargenomewideassociationstudy...........................................................11Samplefiltering.............................................................................................................................11Takingaccountofthedifferentarraysused.......................................................................11Associationtesting.......................................................................................................................11Results..............................................................................................................................................12

Fileprocessing.....................................................................................................................12

References.............................................................................................................................13

3

IntroductionThisdocumentdescribestheanalysiscarriedouttoperformgenotypeimputationfortheinterimreleaseoftheUKBiobank(UKB)genotypedata.Italsoprovidesadviceaboutusingtheimputeddatatocarryoutgenomewideassociationstudies(GWAS)orforextractinggenotypesforuseascovariatesinothertypesofassociationstudy.

Genotypeimputation1,2istheprocessofpredictinggenotypesthatarenotdirectlyassayedinasampleofindividuals.AreferencepanelofhaplotypesatadensesetofSNPs,indelsandstructuralvariants,isusedtoimputegenotypesintoastudysampleofindividualsthathavebeengenotypedatasubsetoftheSNPs.These‘insilico’genotypescanthenbeusedtoboostthenumberofSNPsthatcanbetestedforassociation.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthecausalvariantsandfacilitatesmeta-analysis.Theresultoftheimputationprocessisadatasetwith73,355,667SNPs,shortindelsandlargestructuralvariantsin152,249individuals.SeeBox1of1foraquickvisualoverviewofhowgenotypeimputationworks.

Theprocessofimputationisdividedintotwosteps(i)pre-phasing,and(ii)imputation.Inthefirststep,thesamplestobeimputedare‘pre-phased’i.eastatisticalmethodisappliedtogenotypedatatoinfertheunderlyinghaplotypesofeachindividual.Inthesecondstep,adifferentstatisticalmethodisusedtocombinetheinferredhaplotypeswithareferencepanelofhaplotypesandimputetheunobservedgenotypesineachsample.Thefollowingtwosectionsofthisdocumentdescribehowthepre-phasingandimputationwascarriedoutonthe~150,000samples.

Phasingandimputationcanbeacomputationallyintensiveprocess.Toavoidmanydifferentresearchgroupshavingtocarrythisoutindependently,phasingandimputationwasbeencarriedoutcentrally.QuestionsaboutusingtheimputedgenotypesshouldbesenttotheUKBGeneticsmaillistsetupforthispurpose.Youcansubscribetothemaillistherehttps://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS

4

Phasing

FilteringbeforephasingTocreateaninputdataforthephasingweappliedSNPQCfiltersasdescribedinUKBiobankQCdocumention3.Thesamplesweregenotypedontwoslightlydifferentchips.Approximately50,000weregenotypedaspartoftheULBiLEVEstudyusingachipdesignedforthatstudy(denotedUKBL),withtheremainingsamples(~100,000)genotypedontheUKBchip.Therefore,weapplieddifferentmissingnessfiltersonSNPsdependentuponchip.SNPswereremovedbasedonthenumberofbatchesinwhichtheyarecompletelymissing:

i. SNPsonbothUKBchipandUKBLchip-removethemiftheyaremissinginmorethan3batches(outof33batches)

ii. SNPsontheUKBchipandnottheUKBLchip-removethemiftheyaremissinginmorethan2batches(outof22batches)

iii. SNPsontheUKBLchipandnottheUKBchip-removethemiftheyaremissinginmorethan1batch(outof11batches)

1,037sampleoutliers3wereremoved.Multi-allelicSNPsandSNPswithaminorallelefrequency(MAF)<1%werethenremovedfromthedataset.Thesefiltersresultedinadatasetwith641,018autosomalSNPsin152,256samples.ChromosomeXphasingandimputationwillbecarriedoutatalaterdate.

PhasingmethoddescriptionPhasingontheautosomeswascarriedoutusingamodifiedversionoftheSHAPEIT24programmodifiedtoallowforverylargesamplesizes.Thisnewmethod(whichwerefertoasSHAPEIT3)modifiesSHAPEIT2’ssurrogatefamilyapproachtoremoveaquadraticcomplexitycomponentofthealgorithm5.Insmallsamplesizesofafewthousandsamples,thispartofthealgorithm,whichinvolvescalculatingHammingdistancesbetweencurrenthaplotypesestimates,contributesonlyarelativelysmallparttothecomputationalcost.Assamplesizesincreaseover10,000samplesthenthiscomponentbecomessignificant.Thenewalgorithmusesadivisiveclusteringalgorithmtoidentifyclustersofhaplotypes,andthencalculatesHammingdistancesonlybetweenpairsofhaplotypeswithineachcluster.OnlyhaplotypeswithineachclusterareusedascandidatesforthesurrogatefamilycopyingstatesintheHMMmodel.TheresultingalgorithmhascomplexityO(NlogN)whereNisthenumberofhaplotypesinthedatasetbeingphased.Inpractice,wehaveobservedthatthemethodexhibitsscalingclosetolinear.Thisisacrucialfeatureofthemethod,especiallyforverylargesamplesizes,andapropertynotsharedbyotherapproaches6,7.Thedevelopmentofthisapproachisongoingandthereissubstantialscopetomakefurtherimprovementsinspeedandaccuracy.Anewerversionislikelytoofferanorderofmagnitudereductioninspeed.

5

ValidationofthephasingmethodTheaccuracyofthisnewmethodwasassessedbytakingadvantageof72mother-father-childtriosthatwereidentifiedintheUKBdataset3.ThisfamilyinformationcanbeusedtoinferthephaseofalargenumberofSNPsinthetrioparents.Thesefamilyinferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature4.Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswereestimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof16,762autosomalSNPs.Theinferredhaplotypeswerethencomparedtothetruthsetusingtheswitcherrormetric4.Weobtainedanexceptionallylowswitcherrorrateof0.4%acrossthetriochildrenreportingBritishancestry.Byadjustingparametersofthemethodwehaveobservedswitcherrorrateslowerthan0.3%.Withswitcherrorratesthislow,longchunksofsequenceofmanymegabaseswillbeinferredcorrectly.Downstreamimputationfromsuchhaplotypeswillbehighlyaccurate.Toassesstheperformancegainofphasingall152,112samplestogether,versusphasinginsmallersubsetsofsamplestwoothertestdatasetsofsize1,072and10,072sampleswerecreated,alsocontainingthetriochildren.TheresultsareshowninfulldetailinTable1andhighlightthebenefitsofjointphasingofallthesamples.TheseresultsclearlydemonstratetheclosetolinearscalingoftheSHAPEIT3algorithm.Samplesize Method SwitchError

(%)Runtime(hrs) Run

TimeScaling

SampleSize

Scaling

Threads

1,072 SHAPEIT3 2.6 0.25 1 1 1010,072 SHAPEIT3 1.3 2.5 10 9.4 10152,112 SHAPEIT3 0.4 38.5 154 142 10

Table1:PhasingperformanceonUKBsamples.

WholegenomephasingPhasingwascarriedoutinchunksof5,000SNPs,withanoverlapof250SNPsbetweenchunks.SHAPEIT3wasrunoneachchunkusing4coresperjobandS=200copyingstates.Asapartofthephasingprocessanyremainingmissinggenotypeswereimputedduringthephasing.Chunkswereligatedusingamodifiedversionofthehapfuseprogram.

6

Genotypeimputation

AssessmentoftheUKBiobankArrayforimputationTheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimizeimputationperformanceinGWASstudies8.Anexperimentwascarriedouttoassesstheimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompareperformancetosomeothercommerciallyavailablearrays.

Performancewasassessedusinghigh-coverage,whole-genomesequencedatamadepubliclyavailablebyCompleteGenomics(CG).

Datafrom10samplesfromtheEuropeanancestry(CEU)populationwasused.Allvariantsiteswithacallratebelow90%werefilteredoutinordertoonlyconsiderveryreliablesitesintheanalysis.Onlydatafromchromosome20wasused.Tomimicatypicalimputationanalysis,apseudo-GWASdatasetwasconstructedbyextractingtheCGSNPgenotypesatallthesitesincludedonagivenarray.AllsitesnotonthearraywerethenimputedusingtheUK10Kreferencepanel9.ImputationwascarriedoutusingIMPUTE210whichchoosesacustomreferencepanelforeachstudyindividualineach1Mbsegmentofthegenome.ThekhapparameterofIMPUTE2wassetto1,000.Allotherparametersweresettodefaultvalues.Thisexperimentwasrepeatedfor4differentgenome-wideSNParrays(a)AffymetrixUKBiobankAxiomarray(b)IlluminaOmni2.5Marray(c)IlluminaOmni1MQuad(d)IlluminaOmniExpress.Variantswerestratifiedintoallelefrequencybinsandthesquaredcorrelation(R2)wascalculatedbetweenthealleledosagesatvariantsineachbinwiththemaskedCGgenotypes.Sincedifferentarrayscontaindifferentnumbersofvariantsitisimportanttomakesurethatimputationperformanceismeasuredatthesamesetofvariantswhencomparingchips.Toachievethis,bothimputedandarrayvariantswereincludedintheR2analysis,sothatthecomparisonmeasurestheoverallperformanceofeacharray.Asaconsequence,anarraywithmorevariantswillgainanadvantage,asitisreasonabletoexpectthatdirectlygenotypingavariantwillyieldmoreaccurategenotypesthanimputation.Figure1showstheresultsofthisanalysis.Thex-axisisnon-referenceallelefrequency(%)onalogscale,whichfocusesinonrarervariants.They-axisisimputationperformance(R2).Thesalientpointsare

a. theUKBiobankchip(purple)outperformstheIlluminaOmni1MQuad(blue)andIlluminaOmniExpress(green),bothwhichhavecomparablenumbersofvariants.

b. TheUKBiobankchipperformsalmostaswellastheIllumina2.5Mchip(red),whichhas~3timesthenumberofSNPs.ItisworthnotingthattheUKBchipandIlluminaOmni2.5Mchipareverycloseinthe1-5%range.Alikelyconsequenceofthechipdesignprocessfocusinginpartonthisfrequencyrange8.

7

TheoverallconclusionofthisanalysisisthattheAffymetrixUKBarrayisaverygoodarrayfromwhichtocarryoutgenotypeimputation.ThecaveatisthatthisanalysisisfocusedonsampleswithEuropeanancestry.

Figure1:ComparisonofimputationperformanceoftheUKBiobankArrayandseveralothercommerciallyavailablegenotypingarrays.

ReferencepanelusedforimputationThereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation1,butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanelgrowsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryofthereferencepanelhaplotypes.TheUKBdatasetconsistsofsampleswithadiverserangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)ancestry.ForthisreasonitwasdesirabletouseareferencepanelwithalargenumberofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypesfromotherworld-widepopulations.ToachievethistheUK10Khaplotypereferencepanelwasmergedtogetherwiththe1000GenomesPhase3referencepanelusingthe–merge_ref_panelsoptionintheIMPUTE2software(link).Usingthismergedpanelhasbeenshowntoproduceahigh-qualityreferencepanelforimputation9.AnadvantageofthisreferencepanelisthatitincludesSNPs,shortindelsandlargerstructuralvariants.Thereferencepanelconsistsof87,696,888bi-allelicvariantsin12,570haplotypes.

●●

● ● ● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ●

●● ●

● ● ● ● ● ● ● ●●

●●

● ● ● ● ● ● ● ●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.02 0.05 0.1 0.2 0.5 1 2 5 10 20 50 100

non reference allele frequency (%)

Aggr

egat

e R

2

Genotyping arrayIllumina Omni 2.5MIllumina Omni 1M QuadIllumina Omni ExpressAffy UK Biobank

Genotyping accuracy after imputation from UK10k (7562 haplotypes)Samples: 10 EUR CG2

Comparison at 219303 sites on chr20 (includes genotyped SNPs)Allele frequency calculated from reference panel

8

ImputationmethoddescriptionImputationwascarriedoutusingthesamealgorithmasisimplementedintheIMPUTE2program.ThecurrentIMPUTE2programisaveryflexibletoolforphasingandimputationthatimplementsageneralsetofoptions.AnewC++programwaswrittenfromscratchtofocusexclusivelyonhaploidimputationneededwhensampleshavebeenpre-phased.ThisnewversionisbothmemoryandcomputationallyefficientcomparedtoIMPUTE2.ThemethodtakesadvantageofhighcorrelationsbetweeninferredcopyingstatesintheHMMtoreducecomputation.WerefertothisprogramasIMPUTE3.

WholegenomeimputationImputationwascarriedoutinchunksof2Mbwitha250kbbufferregion.Asetof2,000haplotypecopyingstateswereusedtoimputeeachsample.Imputedvariantsineachnon-overlappingpartofeachchunkwereconcatenatedintoper-chromosomefiles.

Informationscores,minorallelefrequenciesandfilteringQCTOOLwasusedtocalculatetheminorallelefrequency(MAF)andimputationinformationscoreofeachimputedvariant.Theimputationinformationisametricbetween0and1.Avalueof1indicatesthatthereisnouncertaintyintheimputedgenotypeswhereasavalueof0meansthatthereiscompleteuncertaintyaboutthegenotypes.AvalueofαinasampleofNindividualsindicatesthattheamountofdataattheimputedSNPisapproximatelyequivalenttoasetofperfectlyobservedgenotypedatainasamplesizeofαN.

ManyGWAScarriedouttodatehaveusedfiltersonMAFandinformationscorebyapplyingathresholdonthesemetrics.Thereisnosinglecorrectthresholdtouse.However,asMAFdecreasesitisgenerallythecasethatimputationqualitydecreases.Previousstudieshavetendedtouseafilteroninformationbetween0.3-0.5.Sincethesestudieshavetypicallyconsistedofhundredsorlowthousandsofsamplesaninformationof0.3correspondstoaneffectivesamplesizewithlimitedpowertodetectassociations.However,theUKBiobankdatasetisconsiderablylargerinsizethanmostpreviousGWAS.Aninformationmeasureof0.3in~150,000samplesroughlycorrespondstoaneffectivesamplesizeof~45,000,whichwouldbeexpectedtoyieldverygoodpowertodetectassociation.

Somevariantsareimputedasmonomorphic,orclosetomonomorphici.e.nooralmostnovariationinthegenotypes.SuchsiteswereremovedusingQCTOOLusingafilteronMAFof0.001%.Inaddition,7sampleswereremovedfromthedatasetduetotheseindividualshavingrequestedtheirdataberemovedfromthestudy.Theresultingdatasetconsistsof73,355,667variantsin152,249individuals.

Thedistributionofinformationscoresatthese73,355,667variantsisshowninFigure2(a).PlotsstratifiedbyMAFarealsoshown(b)MAF>5%(c)1%<=MAF<5%(d)0.1%<=MAF<1%(e)0.01%<=MAF<0.1%(f)0.001%<=MAF<0.01%.

9

Figure2:Distributionofinformationscoresatvariantsintheimputeddataset.Thex-axisshowstheinformationscoreonthescale0to1.

ImputedgenotypefilesLetGijdenotethegenotypeoftheithsampleatthejthvariant.Theprocessofgenotypeimputationproducesaprobabilitydistributionforeachgenotypei.e.

pij0=P(Gij=AA) pij1=P(Gij=AB) pij2=P(Gij=BB)

whereAandBarethetwoallelesatthevariant.Thisprobabilitytriple(whichsumsto1)isprovidedintheimputedgenotypefilesforeachimputedvariantsinallsamples.SNPvariantsincludedinthephaseddatasetalsooccurintheimputedfilesinthisformat.

TheimputeddataisprovidedinacompressedbinaryBGENfileformat.TheBGENfileformatisabinaryversionoftheGENfileformat.

TheBGENfileformatwaschosentoprovidegoodcompressionoftheimputeddataandeaseofuseforgeneticassociationtestingagainsttraitsandphenotypes.Forexample,programscommonlyusedsuchasSNPTESTandPLINKalreadyreadBGENfiles,andQCTOOLcanbeusedtofilter,summarize,manipulateandconvertthefilestootherformats.

Theformatstoresonevariantatatime(i.e.perrow).AsMAFdecreasesmorecompressionispossibleduetoincreasedsimilaritybetweenimputedgenotypesacross

(a) All variants

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

2e+06

4e+06

6e+06

8e+06

1e+07

(b) MAF >= 5% : #SNPs = 7011470

InformationFrequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(c) 1% <= MAF < 5% : #SNPs = 2889302

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(d) 0.1% <= MAF < 1% : #SNPs = 10051623

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(e) 0.01% <= MAF < 0.1% : #SNPs = 26262886

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(f) 0.001% <= MAF < 0.01% : #SNPs = 26140277

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.00e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

10

samples.ThetotalsizeoftheUKBInterimreleasedatasetis1.3Tb,witheachchromosomefileranginginsizefrom20Gbto109Gb.Asthefileformatisbinarythefilesarenotviewableinnormaltexteditors.Laterinthisdocumentthereisadviceandguidanceonworkingwiththesefiles.

Thefilesarenamedas

chrNimpv1.bgen

whereNisthenumberoftheautosome(N=1,….,22).

RSIDswereaddedintotheBGENfilesforasmanyvariantsaspossibleusingavailableRSIDlistsavailablefromtheUK10Kwebsiteandthe1000Genomeswebsite.

RSIDsareuseful,uniqueidentifiersofSNPsandothervariantsandcanbelookedupinthedbSNPdatabase.WhenresearchersreportassociationsofvariantswithdiseasesandtraitstheynormallyreporttheresultsusingtheRSID.

VariantpositionsarereportedinGenomeReferenceConsortiumHumangenomebuild37co-ordinates(GRChb37).

SamplefilesInadditiontothe22autosomalBGENfiles,thereisfilecalledimpv1.sample

Thisfile(referedtoasthe`samplefile’)isthepartoftheBGENfileformatthatstoresinformationabouteachsampleinthedataset.TheformatofthisfileisdescribedontheGENfileformatwebpage.

Thesamplefilehastwoheaderlines,followedby1lineforeachindividualintheBGENfile.TheorderoftheindividualsinthesamplefilematchestheorderoftheindividualsintheBGENfile.Theorderisimportant.Programsthatreadbgen/samplepairsassumethattheordermatchesbetweenthefiles.

Thesamplefilecanbeusedtostoreinformationabouteachindividuali.e.phenotypesandcovariates.IfphenotypesandcovariatesareaddedintothesamplefilethenSNPTESTcanbeusedtocarryoutassociationtestingateachvariant.Careshouldbetakeninmakingsurethatsuchinformationiscorrectlyaddedtosamplefiles.Theformatallowsdiscreteandcontinuousphenotypesandcovariates,aswellasmissingvalues(seefileformatwebpagelinkabove).

DifferencesbetweenrawgenotypesandimputedfilesSNPsbelow1%MAFwerefilteredoutbeforethephasingstep,howevermanyoftheseSNPswillhavebeenimputed.ThereforetheseSNPswillappearintherawgenotypefiles,andtheimputedfiles,butmayhavedifferentgenotypes.Assuch,researchersshouldnotbesurprisediftheresultsofanalysisattheseSNPsdifferdependentuponwhichfilesareused.

11

AnexemplargenomewideassociationstudyAGWASforthephenotypeofheightwascarriedouttoassesstheuseoftheUKBiobankgeneticdataasaresourceforgeneticassociationstudies.Therearealreadyasubstantialnumberofreplicatedassociations11.Thepurposeofthisanalysiswasnottoreportnewassociations,butrathertocheckthatareasonablystandardGWASpipelineproducedvalidresults.

SamplefilteringPrincipalcomponentanalysisandtheself-declaredethnicitywereusedtoderivea“WhiteBritish”subsetofsamples.Inaddition,sampleswereexcludediftheyhad

(a) atleastonerelatedsample(b) ageneticallyinferredgenderthatdidnotmatchtheself-reportedgender.(c) ~500extremeoutliers3.

Thesefiltersresultedinadatasetwith112,338samples.

TakingaccountofthedifferentarraysusedSomeSNPsareonlyincludedononeoftheUKBBorUKBLarrays.AtsuchSNPs,missinggenotypeswillhavebeenimputedaspartofthephasingprocess,sothattheseSNPswillconsistofamixtureofgenotypedandimputedSNPs.Thiscanleadtobiasinassociationtestingifthereissomecorrelationbetweenthephenotypeandwhicharrayasamplewasassayedon.ThesamplesinvolvedintheUKBLstudywereselectedbasedonphenotypesassociatedwithlungfunction12,thusitmaybepossibleforsuchassociationstooccur.Thereareatleast2solutionstoameliorateanypossibleconfoundingduetoarray

a. carryoutassociationtestsconditioningonabinaryindicatorofarray.b. carryoutseparatetestsofassociationinUKBBsamplesandUKBLsamplesand

combinetheresultsusingmeta-analysis.

AssociationtestingGWASwasperformedatallvariantsusingSNPTEST.AnadditivegeneticmodelwasfittedateachSNP,usinggender,age,arrayand10principalcomponentsascovariates.Thatis,theexampleusesoption(a)above.Theprogramoption–methodexpectedwasusedintheSNPTESTsoftware,whichconvertsthegenotypeprobabilitytripletoanexpectedgenotype,dij,(oftencalledthedosage),calculatedas

𝑑!" = 𝑘𝑝!"#

!

!!!

12

ResultsTheGWASforheightproducedasubstantialnumberofassociatedregions.TheseregionshadahighcorrespondencetothosegeneticregionsthathavepreviouslybeenreplicatedforheightanddescribedintheNHGRIGWASCatalog11.Theanalysissuggestedasignificantnumberofnovellocicouldbeidentified.Figure3showsaplotofthe–log10p-valuesfortheheightandBMIscansonchromosome4.

Figure3:Chromosome4GWASforheight.Thex-axisshowsphysicalposition.They-axisis–log10p-valueforeachtestedvariant.Variantsonthearrayareshownasblackdots,imputedvariantsareshownasgreydots.ReportedassociationsfromtheNHGRIGWASCatalogareshownasredcrosses.Theblueandredhorizontallinesaredrawnata–log10p-valueof5and7.5respectively.

FileprocessingWerecommendthatresearchersusetheQCTOOLprogramtohandletheBGENfiles.Thisprogramhasoptionsforextractionorremovalofsubsetsofthedata(SNPsand/orsamples),andforfileformatconversion.SeetheQCTOOLexamplespageforinformationoncommandlinesusedtoperformspecifictasks.TheprogramSNPTESTcanprocessBGENfiles.ItwillautomaticallydetecttheBGENfileformatifdatafilesarenamedwiththe.bgenextension.PLINKv1.9canprocessBGENfiles;atthetimeofwritingBGENfilesarespecifiedusingthe--bgenoption.ForfurtherinformationontoolssupportingtheBGENformat,seetheBGENfileformatwebsite.

13

References1. Marchini,J.&Howie,B.Genotypeimputationforgenome-wideassociation

studies.Nat.Rev.Genet.11,499–511(2010).2. Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.&Abecasis,G.R.Fastand

accurategenotypeimputationingenome-wideassociationstudiesthroughpre-phasing.Nat.Genet.44,955–959(2012).

3. TheUKBiobank.UKBiobankGenotypingQCdocumentation.(2015).4. Delaneau,O.,Zagury,J.-F.&Marchini,J.Improvedwhole-chromosomephasing

fordiseaseandpopulationgeneticstudies.Nat.Methods10,5–6(2013).5. O'Connell,J.,Sharp,K.,Delaneau,O.&Marchini,J.Haplotypeestimationfor

biobankscaledatasets.(2015)(submitted)6. Kong,A.etal.Detectionofsharingbydescent,long-rangephasingand

haplotypeimputation.Nat.Genet.40,1068–1075(2008).7. Williams,A.L.,Patterson,N.,Glessner,J.,Hakonarson,H.&Reich,D.Phasingof

manythousandsofgenotypedsamples.Am.J.Hum.Genet.91,238–251(2012).8. TheUKBiobankArrayDesignGroup.UKBiobankAxiomArrayContentSummary.

(2014).9. Huang,J.etal.Improvedimputationoflow-frequencyandrarevariantsusing

theUK10Khaplotypereferencepanel.NatureCommunications6,8111(2015).10. Howie,B.,Marchini,J.&Stephens,M.Genotypeimputationwiththousandsof

genomes.G3(Bethesda)1,457–470(2011).11. Welter,D.etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-trait

associations.Nucl.AcidsRes.42,D1001–6(2014).12. Wain,L.V.etal.Novelinsightsintothegeneticsofsmokingbehaviour,lung

function,andchronicobstructivepulmonarydisease(UKBiLEVE):ageneticassociationstudyinUKBiobank.LancetRespirMed3,769–781(2015).

top related