genotyping informatics and quality control for 100,000 ...affymetrix axiom genotyping solution....

25
GENETICS | INVESTIGATION Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort Mark N. Kvale,* ,1,2 Stephanie Hesselson,* ,1 Thomas J. Hoffmann,* ,Yang Cao,* David Chan, Sheryl Connell, § Lisa A. Croen, § Brad P. Dispensa,* Jasmin Eshragh,* Andrea Finn, Jeremy Gollub, Carlos Iribarren, § Eric Jorgenson, § Lawrence H. Kushi, § Richard Lao,* Yontao Lu, Dana Ludwig, § Gurpreet K. Mathauda,* William B. McGuire, § Gangwu Mei, Sunita Miles, § Michael Mittman, Mohini Patil, Charles P. Quesenberry Jr., § Dilrini Ranatunga, § Sarah Rowell, § Marianne Sadler, § Lori C. Sakoda, § Michael Shapero, Ling Shen, § Tanu Shenoy,* David Smethurst, § Carol P. Somkin, § Stephen K. Van Den Eeden, § Lawrence Walter, § Eunice Wan,* Teresa Webster, Rachel A. Whitmer, § Simon Wong,* Chia Zau, § Yiping Zhan, Catherine Schaefer, §,1,2 Pui-Yan Kwok,** ,1,2 and Neil Risch* ,,§,1,2 *Institute for Human Genetics, University of California, San Francisco, California 94143, Department of Epidemiology and Biostatistics, and **Cardiovascular Research Institute, University of California, San Francisco, California 94158, Affymetrix Inc., Santa Clara, California 95051, and § Kaiser Permanente Northern California Division of Research, Oakland, California 94612 ABSTRACT The Kaiser Permanente (KP) Research Program on Genes, Environment and Health (RPGEH), in collaboration with the University of CaliforniaSan Francisco, undertook genome-wide genotyping of .100,000 subjects that constitute the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. The project, which generated .70 billion genotypes, represents the rst large-scale use of the Affymetrix Axiom Genotyping Solution. Because genotyping took place over a short 14-month period, creating a near-real-time analysis pipeline for experimental assay quality control and nal optimized analyses was critical. Because of the multi-ethnic nature of the cohort, four different ethnic-speci c arrays were employed to enhance genome-wide coverage. All assays were performed on DNA extracted from saliva samples. To improve sample call rates and signi cantly increase genotype concordance, we partitioned the cohort into disjoint packages of plates with similar assay contexts. Using strict QC criteria, the overall genotyping success rate was 103,067 of 109,837 samples assayed (93.8%), with a range of 92.195.4% for the four different arrays. Similarly, the SNP genotyping success rate ranged from 98.1 to 99.4% across the four arrays, the variation depending mostly on how many SNPs were included as single copy vs. double copy on a particular array. The high quality and large scale of genotype data created on this cohort, in conjunction with comprehensive longitudinal data from the KP electronic health records of participants, will enable a broad range of highly powered genome-wide association studies on a diversity of traits and conditions. KEYWORDS genome-wide genotyping; GERA cohort; Affymetrix Axiom; saliva DNA; quality control T HE Genetic Epidemiology Research on Adult Health and Aging (GERA) resource is a cohort of .100,000 subjects who are participants in the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC), Research Program on Genes, Environment and Health (RPGEH) (detailed de- scription of the cohort and study design can be found in dbGaP, Study Accession: phs000674.v1.p1). Genome-wide genotyping was targeted for this cohort to enable large-scale genome-wide association studies by linkage to comprehensive longitudinal clinical data derived from extensive KPNC elec- tronic health record databases. The cohort is multi-ethnic, with 20% minority representation (African American, East Copyright © 2015 by the Genetics Society of America doi: 10.1534/genetics.115.178905 Manuscript received January 16, 2015; accepted for publication June 2, 2015; published Early Online June 19, 2015. Supporting information is available online at www.genetics.org/lookup/suppl/ doi:10.1534/genetics.115.178905/-/DC1. Sequence data from this article have been deposited with the dbGaP under accession no. phs00078. 1 These authors contributed equally to this work. 2 Corresponding authors: University of CaliforniaSan Francisco (UCSF), Institute for Human Genetics, 513 Parnassus Ave., San Francisco, CA 94143-0794. E-mail: [email protected]; Kaiser Permanente Northern California Division of Research, Oakland, CA 94612. E-mail: [email protected]; University of CaliforniaSan Francisco (UCSF), Institute for Human Genetics, 513 Parnassus Ave., San Francisco, CA 94143-0794. E-mail: [email protected]; and University of CaliforniaSan Francisco (UCSF), Institute for Human Genetics, 513 Parnassus Ave., San Francisco, CA 94143-0794. E-mail: [email protected] Genetics, Vol. 200, 10511060 August 2015 1051

Upload: others

Post on 26-Apr-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

GENETICS | INVESTIGATION

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale12 Stephanie Hesselson1 Thomas J Hoffmanndagger Yang Cao David ChanDagger

Sheryl Connellsect Lisa A Croensect Brad P Dispensa Jasmin Eshragh Andrea FinnDagger Jeremy GollubDagger

Carlos Iribarrensect Eric Jorgensonsect Lawrence H Kushisect Richard Lao Yontao LuDagger Dana Ludwigsect

Gurpreet K Mathauda William B McGuiresect Gangwu MeiDagger Sunita Milessect Michael MittmanDagger

Mohini PatilDagger Charles P Quesenberry Jrsect Dilrini Ranatungasect Sarah Rowellsect Marianne Sadlersect

Lori C Sakodasect Michael ShaperoDagger Ling Shensect Tanu Shenoy David Smethurstsect Carol P Somkinsect

Stephen K Van Den Eedensect Lawrence Waltersect Eunice Wan Teresa WebsterDagger Rachel A Whitmersect

Simon Wong Chia Zausect Yiping ZhanDagger Catherine Schaefersect12 Pui-Yan Kwok12 and Neil Rischdaggersect12

Institute for Human Genetics University of California San Francisco California 94143 daggerDepartment of Epidemiology andBiostatistics and Cardiovascular Research Institute University of California San Francisco California 94158 DaggerAffymetrix Inc

Santa Clara California 95051 and sectKaiser Permanente Northern California Division of Research Oakland California 94612

ABSTRACT The Kaiser Permanente (KP) Research Program on Genes Environment and Health (RPGEH) in collaboration with the University ofCaliforniamdashSan Francisco undertook genome-wide genotyping of 100000 subjects that constitute the Genetic Epidemiology Research onAdult Health and Aging (GERA) cohort The project which generated 70 billion genotypes represents the first large-scale use of theAffymetrix Axiom Genotyping Solution Because genotyping took place over a short 14-month period creating a near-real-time analysis pipelinefor experimental assay quality control and final optimized analyses was critical Because of the multi-ethnic nature of the cohort four differentethnic-specific arrays were employed to enhance genome-wide coverage All assays were performed on DNA extracted from saliva samples Toimprove sample call rates and significantly increase genotype concordance we partitioned the cohort into disjoint packages of plates withsimilar assay contexts Using strict QC criteria the overall genotyping success rate was 103067 of 109837 samples assayed (938) witha range of 921ndash954 for the four different arrays Similarly the SNP genotyping success rate ranged from 981 to 994 across the fourarrays the variation depending mostly on how many SNPs were included as single copy vs double copy on a particular array The high qualityand large scale of genotype data created on this cohort in conjunction with comprehensive longitudinal data from the KP electronic healthrecords of participants will enable a broad range of highly powered genome-wide association studies on a diversity of traits and conditions

KEYWORDS genome-wide genotyping GERA cohort Affymetrix Axiom saliva DNA quality control

THE Genetic Epidemiology Research on Adult Health andAging (GERA) resource is a cohort of 100000 subjects

who are participants in the Kaiser Permanente Medical CarePlan Northern California Region (KPNC) Research Programon Genes Environment and Health (RPGEH) (detailed de-scription of the cohort and study design can be found indbGaP Study Accession phs000674v1p1) Genome-widegenotyping was targeted for this cohort to enable large-scalegenome-wide association studies by linkage to comprehensivelongitudinal clinical data derived from extensive KPNC elec-tronic health record databases The cohort is multi-ethnicwith 20 minority representation (African American East

Copyright copy 2015 by the Genetics Society of Americadoi 101534genetics115178905Manuscript received January 16 2015 accepted for publication June 2 2015published Early Online June 19 2015Supporting information is available online at wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1Sequence data from this article have been deposited with the dbGaP underaccession no phs000781These authors contributed equally to this work2Corresponding authors University of CaliforniamdashSan Francisco (UCSF) Institute forHuman Genetics 513 Parnassus Ave San Francisco CA 94143-0794E-mail kvalemhumgenucsfedu Kaiser Permanente Northern California Division ofResearch Oakland CA 94612 E-mail CathySchaeferkporg University ofCaliforniamdashSan Francisco (UCSF) Institute for Human Genetics 513 Parnassus AveSan Francisco CA 94143-0794 E-mail puikwokucsfedu and University ofCaliforniamdashSan Francisco (UCSF) Institute for Human Genetics 513 Parnassus AveSan Francisco CA 94143-0794 E-mail rischnhumgenucsfedu

Genetics Vol 200 1051ndash1060 August 2015 1051

Asian and Latino or mixed) and the remaining 80 non-Hispanic white For this project four ethnic-specific arrayswere designed based on the Affymetrix Axiom GenotypingSystem (Hoffmann et al 2011ab)

The genotyping assay experiment took place over a 14-month period and to our knowledge is the single largestgenotyping experiment to date producing 70 billion geno-types The magnitude of the experiment in conjunction withthe long duration and simultaneous high throughput re-quired new protocols for assuring quality control (QC) dur-ing the assays and new genotyping strategies in postassaydata analysis

Samples were assayed at an average rate of 1600 perweek over the course of the experiment The sustained highthroughput meant that it was critical to rapidly identifyproblems such as systematic handling errors equipment fail-ures or deficiencies in consumables that can vary over timeIt was thus crucial to create a mechanism to assess assayquality in as near real time as possible It was also importantto assess trends in performance over weekly and monthlyintervals to detect deterioration in performance over timeOne of the advantages of this study is that all samples wereextracted and normalized in a single lab at KPNC andall samples were assayed in a single lab at University ofCaliforniamdashSan Francisco (UCSF) Robotic processing wasused where possible to increase the consistency of the assayand to prevent mistakes

Here we provide details of the DNA extraction processthe genotyping process and the QC steps taken during theexperiment We also describe the postassay processingcreated to optimize genotype reproducibility across the cohortand the final result of genotyping in terms of numbers ofsamples and SNPs passing strict quality control criteria foreach of the four arrays

Materials and Methods

Study population

Participants in the project were all adult ($18 years old)members of the KPNC who had consented to participate inthe KPNC RPGEH Participants provided a saliva sampleand broadly consented to the use of their DNA mailedsurvey response data and linked electronic health recordsin health research Over a 32-month period beginning inJuly 2008 the RPGEH collected 140000 saliva samplesprior to the conclusion of DNA extraction for the studywith the bulk collected between October of 2008 and No-vember of 2009 (Supporting Information Figure S1) TheGERA cohort of 110266 participants was formed by in-cluding samples from all racial and ethnic minority par-ticipants in the RPGEH enrolled through February 2011(the conclusion of DNA extraction for the project) withthe remainder of the cohort composed of samples fromnon-Hispanic white participants The resulting cohort in-cluded 192 individuals who identified themselves as

having African American Asian HispanicLatino NativeAmerican or Pacific Island raceethnicity with the remain-der (808) reporting themselves as non-Hispanic whiteThe cohort included 110266 subjects to ensure that atleast 100000 individuals were successfully genotyped bythe end of the project The institutional review boards forhuman subjects research of both KPNC and UCSF ap-proved the project

Saliva sample collection DNA extraction and transferof samplesdata

Following completion of a mailed health survey and returnby mail of a signed consent form the RPGEH collecteda saliva sample by mail from consented participants using anOragene OG-250 disc format saliva kit which is designed tobe sent through the mail in a padded envelope Each kit wasidentified with a unique barcode that was linked to theparticipantrsquos unique study identification number Along withthe kit participants received written instructions on provid-ing a saliva specimen following the protocol provided forcompletion of Oragene kits A telephone number was pro-vided for participants to call with questions Participantswere instructed to refrain from eating for 30 min prior tospitting in the kit and to rinse their mouths with water priorto providing the saliva sample Each kit holds about 4 ml ofsaliva a preservative in the cap of the kit is released whenthe participant screws on the lid of the kit after providinga saliva sample According to documentation provided bythe kitrsquos manufacturer the sealed kits would maintainDNA quality when stored at ambient temperature for a pe-riod of 5 years Returned kits were logged in weighed (todetect empty or low volume samples) and stored unopenedin a single layer in boxes in a storage room maintained atambient room temperature

DNA was extracted and normalized in a single laboratoryof the KPNC Division of Research beginning in November2009 and concluding in March 2011 in a fully automatedsystem with integrated tracking and data capture of allsamples and aliquots The average length of time a samplewas stored prior to DNA extraction was 48364 days (SD12390) ranging from a minimum of 10 days to a maximumof 916 days with the majority stored for 11ndash16 months(Figure S2) Saliva samples from a total of 124185 partic-ipants were used for the project Weighing and visual in-spection of opened saliva kits resulted in exclusion of2435 (2) from further processing due to low volume orparticulate matter in the saliva A sample of 05 ml of salivawas drawn and placed in two deep well blocks for DNAextraction with the remaining saliva placed in cryovialsand stored at 280 Following incubation of saliva samplesAgencourt-DNAdvance SPRI paramagnetic bead technologykits were used to extract DNA from 121750 samplesExtracted samples were quantified by PicoGreen (InvitrogenQuant-iT dsDNA assay kits) with a fluorescence intensity-topread via the DTX880 Multimode Detector (Beckman CoulterBrea CA) For most of the project samples that were within

1052 M N Kvale et al

an initial concentration range of 30ndash470 ngml were hit-picked and then normalized via a Span-8 Biomek FXP (Beck-man Coulter) in 10 mM Tris 01 mM EDTA pH 80 toa concentration of 10 ngml for genotyping To accommo-date the timing of the design of the ancestry-specific micro-arrays used for genotyping (Hoffmann et al 2011ab)samples from self-reported non-Hispanic white participantswere extracted first normalized and sent to the UCSFGenomics Core Facility for genotyping prior to the extractionand plating of DNA samples from self-reported minority par-ticipants To maximize the inclusion of ethnic minority par-ticipants in the project we reduced the lower end of theeligible initial concentration range to 15 from 30 ngmlOverall 104 of extracted samples were excluded becausethe concentration fell outside the specified range A subsetof DNA samples was also checked for purity using a NanoDropTechnologies ND-1000 spectrophotometer The A260280

ratios of these samples were in the range 15ndash19 with mostsamples between 17 and 18 The average yield of DNAfrom each 05 ml of saliva used was 486 mg A total of110266 samples of DNA were normalized into 96-wellplates The plates of normalized DNA linked a unique sam-ple identification number to the well position and platenumber of the DNA sample The plates of DNA were trans-ferred to the UCSF Genomics Core Facility along with a com-puter file that provided the link between the sampleidentification number and the sample well position andplate number The laboratory results were linked to thesame identification number genotyping calls were alsolinked to the same identifier and the file was returned toKaiser Permanente for linkage to survey data and electronicmedical records using a key that linked the original identi-fiers with the samples

Axiom platform design

The genotyping experiment used custom designed arraysbased on the Affymetrix Axiom Genotyping Solution(Affymetrix White Paper) It is a two-color ligation-basedassay utilizing on average 30-mer oligonucleotide probessynthesized in situ on a microarray substrate with auto-mated parallel processing of 96 samples per plate with a to-tal of 138 million features available for experimentalcontent The design of the four ethnic-specific arrays hasbeen described previously (Hoffmann et al 2011ab) Theoriginal design of the Axiom arrays required probe sets foreach SNP to be included at least twice for QC reasons How-ever after initial experience with our first array for non-Hispanic whites (Hoffmann et al 2011a) we expandedthe number of SNPs on subsequent arrays by including someSNP probe sets only once based on high-performance char-acteristics (Hoffmann et al 2011b) The four arrays weredesigned to maximize coverage in non-Hispanic whites(EUR) East Asians (EAS) African Americans (AFR) andLatinos (LAT) The number of autosomal X-linked Y-linkedand mitochondrial SNPs on each of the four arrays is pro-vided in Table 1 As detailed below the presence of a higher

proportion of probe sets tiled once on the AFR and LATarrays affected their QC characteristics

Assay protocol

At UCSF a small set of successfully run plates wasdesignated as a source of duplicate samples for later platesOn each plate one of the wells was filled with a previously runsample (ie as a duplicate) to evaluate assay reproducibility asa QC measure going forward The remaining 95 wells wereoccupied by new samples These duplicate samples were theonly ones run more than one time in the study Most duplicatesamples were run on the same ethnic array as the originalsample However because all duplicates were from success-fully previously run samples for the first few plates for eacharray this was not possible Hence for the first few EAS AFRand LAT plates the duplicate sample came from a sample pre-viously run on the EUR array For the first few EUR plates theduplicate samples were from prior Affymetrix referencesamples

The samples were then processed using the standardAffymetrix Axiom sample prep protocol (Affymetrix Geno-typing Protocol) A major change in protocol that occurredduring the experiment after all EUR and EAS plates wereprocessed was the implementation of a novel reagent kitfrom Affymetrix The original reagent kit designated Axiom10 was primarily used for assaying the EUR and EASarrays while the updated reagent kit referred to as Axiom20 was used for assaying AFR LAT and later EUR and EASarrays The differences between Axiom 10 and Axiom 20are at the level of specific reagents in module 1 such ascosolvents used in the isothermal whole genome amplifica-tion The shift to Axiom 20 enabled the validation of moreSNPs overall and increased genome-wide coverage witha modest loss of previously validated SNPs on Axiom arraysThe reagent kit affected intensity cluster centers in thegenotype clustering process and hence needed to becontrolled for during genotyping as described below

Samples were generally randomized across Axiomplates within array type but some nonrandom structurewas necessary from pragmatic considerations First sub-jects on each array type were genotyped together in-dependently of the other array types Second the firstplates assayed with the EUR arrays contained many of theoldest subjects (ages 85 and older) whose samples hadbeen collected first to maximize their participation in theresearch The first samples were then prioritized for DNAextraction and genotyping resulting in a higher percentageof samples from elderly subjects in the first group of platesof non-Hispanic whites to be genotyped Most platescontained a random distribution of females vs males(according to the overall 5842 ratio of females to malesin the cohort) However because the GERA cohort alsoincluded a subset of male subjects from the CaliforniaMenrsquos Health Study (Enger et al 2006) who were pro-cessed during a distinct time period some plates containeda majority of male subjects

Genome-Wide Genotyping of GERA Cohort 1053

Once samples were prepared and embedded into Axiomassay plates they were assayed on Axiom Gene Titans Thestandard Affymetrix protocol was generally followed withthe exception of the hybridization time The experimentbegan with the standard 24-hr hybridization interval but itwas soon found to be advantageous to hybridize for longerperiods of 48ndash72 hr to maximize the signal-to-noise ratioBoth 48- and 72-hr hybridization times gave similar perfor-mance and both were used to facilitate lab scheduling

Genotype calling protocol

To rapidly identify system problems affecting genotypequality a short-term automated quality control feedbackcycle was created for the earliest possible detection andrepair of major problems in the assay The feedback cyclewas based on a data-driven automatic analysis pipelineusing Affymetrix Power Tools (APT) As soon as a GeneTitan completed its scanning phase CEL files were trans-ferred to a Linux cluster for analysis There all successfullyassayed samples from a single plate were genotypedtogether and the results sent via E-mail to the project staff

The genotype calling was based on the standard three-step process as described in the Affymetrix analysis guide(Figure 1) (Affymetrix Axiom Analysis Guides 2015) Fora sample to pass quality control it required a DishQC(DQC) score $082 and a first pass sample call rate (CR1)$97 DishQC is a measure of the contrast between the ATand GC signals assayed in nonpolymorphic test sequences(Affymetrix White Paper) It provides a type of signal-to-noise figure of merit that is well correlated with sample callrate allowing for prediction of successful samples

It was found advantageous to create custom Bayesianpriors for genotyping intensity cluster centers and covariancesrather than using the standard Affymetrix supplied priorsTo create custom Bayesian custom priors samples from 8ndash12 high-performing plates distributed across time and assayconditions were genotyped together using weak genericpriors The resulting posterior distribution statistical param-eters became the new custom priors Custom priors provedsuperior for two reasons saliva-based DNA tended to pro-duce different cluster centers than the Affymetrix blood-based priors and increasing the hybridization time alsotended to shift cluster centers

In addition to the genotyping results montages of arraydigital images were used to visually identify errors inprocessing and sources of unexpected noise (Figure S3)These montages proved especially valuable in diagnosingproblems in a prompt fashion

While short-term monitoring of the genotype assayallowed discovery and correction of acute problems itdid not address changes over time Therefore to trackgradually developing problems (such as with equipmentand consumables) we created raster plots of the DQC andfirst-pass call rate (CR1) to monitor performance overtime Figure 2 shows the distribution of DQC and CR1 foreach Axiom plate assayed in the experiment

Reproducibility analysis

To detect changes in the assay over the course of theexperiment we analyzed discordance of duplicate samples runon different plates as described above The gap in plate numberbetween the original and duplicate assay was varied to givea range of intervals to examine Figure S4 shows the distributionof original vs duplicate sample plate numbers This distributionallowed for detection of changes over a course of months

Genotype discordance is defined for a pair of duplicatesamples as the proportion of genotype pairs called on bothsamples that differ from each other relative to the number ofgenotype pairs for which both genotypes are called Discor-dance can be categorized as a single allele difference (onesample call being a homozygote and the other a heterozy-gote) or as a two-allele difference (the two samples beingcalled opposite homozygotes) The large majority of dis-cordances involve single-allele differences

Discordant genotype pairs necessarily contain at leastone minor allele hence a genotype discordance rate fora SNP is limited by minor allele frequency (MAF) For rare SNPsit is typically the minor allele genotypes that are important andhence the discordance rate among the heterozygous and minorallele homozygous genotypes becomes of interest as opposedto the majority of genotype pairs which are major allelehomozygotes and concordant To address discordance in thesecases we define a minor allele pair (MAP) as a genotype pairboth genotypes called in which at least one genotype includesa minor allele Then a useful measure of discordance amongsuch pairs is the genotype discordance rate per MAP

discordance per MAP frac14 discordant pairs=minor allele pairs

Genotype-calling package design

At various points along the duplicate plate number axisreproducibility analysis showed sudden changes in genotypeconcordance By conducting a change-point analysis andcorrelating the changes with known experimental eventsfactors affecting the probe intensity cluster centers were

Table 1 Number and type of SNPs assayed and passing QC by array

Array SNPs assayed SNPs tiled once Autosomal X-linked Y-chrom mtDNA Passing QC of SNPs passing

EUR 674518 0 660990 13123 289 116 670572 9942EAS 712950 65473 699324 13385 158 83 708373 9936AFR 893631 429451 867035 26264 234 98 878176 9827LAT 817810 282901 792056 25397 234 123 802186 9809

1054 M N Kvale et al

discovered These factors along with known prior factors suchas array type formed the categorical dimensions we used topartition the samples for final genotype calling of the cohortThe factors were chosen to maximize within-package homo-geneity and hence optimize the empirical genotype concor-dance across duplicate samples in different parts of thepartition The factors driving this partition included array typehybridization time reagent kit type reagent lot and initial(low) DNA concentration Low-concentration samples includedthose with initial concentrations between 15 and 30 ngmlthose 30 ngml were considered to be in the normal range

On the basis of these factors the samples were parti-tioned into ldquopackagesrdquo consisting of samples from 2ndash26plates to achieve homogeneity of conditions Each packagethen underwent genotype calling separately

Genotype postprocessing

After the package-based genotypes were created genotyp-ing quality for each probe set was assessed Genotypes forpoorly performing probe sets in each package were filteredout of the final data set this was called per-packagefiltering Probe set performance was also assessed acrossall packages for a particular array type and poorly perform-ing probe sets across packages were filtered out of allpackages for that array this was termed per-array filteringIn filtering genotypes and probe sets within and acrosspackages we employed liberal call rate thresholds to retainthe maximum amount of potentially useful data Thereforedepending on the particular use further genotype filteringmay be appropriate

The filtering pipeline included five steps as follows

1 Per-package filtering of probe sets with a call rate 902 Filtering of probe sets with large allele frequency varia-

tion across packages within an array Specifically if pi isthe allele frequency for package i and p9 is the averageallele frequency across packages then

variancethinsp ratio frac14 p9eth12 p9THORN=VarethpiTHORN

A large variance ratio indicates a stable allele frequencyacross packages Probe sets with variance ratio 31for a given array were filtered out

3 Filtering of autosomal probe sets based on a large allelefrequency difference (015) between males andfemales

4 Filtering of probe sets with poor overall performanceSpecifically the per-array genotyping rate for each probeset was calculated as the total number of genotypesacross all packages for that array that are called vsattempted and those with overall call rate 060 werefiltered out

5 Filtering of probe sets with poor genotype concordanceacross duplicates Specifically for each probe set and ar-ray type the number of discordant genotypes across allpairs of duplicate samples in which both samples wereassayed on the same array type was calculated Thenprobe sets for which the genotype discordance countexceeded array-dependent thresholds (208 discordantof 851 possible for EUR 23 discordant of 61 possiblefor EAS 8 discordant of 12 possible for AFR 26 discor-dant of 71 possible for LAT) were filtered out

The thresholds described in each step above are to somedegree arbitrary but were chosen to be conservative interms of number of SNPs removed The intention was toremove only SNPs with a very high probability of inaccuracythis meant that poor performing SNPs are likely remainingbut can be filtered out at later stages by end users

The threshold of 90 for per-package filtering waschosen through inspection of a statistical sample of SNPintensity plots at various SNP call rates Those SNPs witha call rate 90 invariably had problems with clustersplits highly overlapping clusters or pathological clusterstructure (such as a large number of off-target variantcalls) that rendered the genotypes uncertain Above90 SNPs with usable genotypes could be found withthe percentage of usable SNPs increasing with increasingcall rate

Figure 1 Standard Affymetrix Axiom Genotyping analysis workflow

Figure 2 First-pass sample call rate (CR1) vs DQC Black points are sam-ples assayed with Axiom 10 and red points are those with Axiom 20Threshold for a sample to pass is call rate $97

Genome-Wide Genotyping of GERA Cohort 1055

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 2: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Asian and Latino or mixed) and the remaining 80 non-Hispanic white For this project four ethnic-specific arrayswere designed based on the Affymetrix Axiom GenotypingSystem (Hoffmann et al 2011ab)

The genotyping assay experiment took place over a 14-month period and to our knowledge is the single largestgenotyping experiment to date producing 70 billion geno-types The magnitude of the experiment in conjunction withthe long duration and simultaneous high throughput re-quired new protocols for assuring quality control (QC) dur-ing the assays and new genotyping strategies in postassaydata analysis

Samples were assayed at an average rate of 1600 perweek over the course of the experiment The sustained highthroughput meant that it was critical to rapidly identifyproblems such as systematic handling errors equipment fail-ures or deficiencies in consumables that can vary over timeIt was thus crucial to create a mechanism to assess assayquality in as near real time as possible It was also importantto assess trends in performance over weekly and monthlyintervals to detect deterioration in performance over timeOne of the advantages of this study is that all samples wereextracted and normalized in a single lab at KPNC andall samples were assayed in a single lab at University ofCaliforniamdashSan Francisco (UCSF) Robotic processing wasused where possible to increase the consistency of the assayand to prevent mistakes

Here we provide details of the DNA extraction processthe genotyping process and the QC steps taken during theexperiment We also describe the postassay processingcreated to optimize genotype reproducibility across the cohortand the final result of genotyping in terms of numbers ofsamples and SNPs passing strict quality control criteria foreach of the four arrays

Materials and Methods

Study population

Participants in the project were all adult ($18 years old)members of the KPNC who had consented to participate inthe KPNC RPGEH Participants provided a saliva sampleand broadly consented to the use of their DNA mailedsurvey response data and linked electronic health recordsin health research Over a 32-month period beginning inJuly 2008 the RPGEH collected 140000 saliva samplesprior to the conclusion of DNA extraction for the studywith the bulk collected between October of 2008 and No-vember of 2009 (Supporting Information Figure S1) TheGERA cohort of 110266 participants was formed by in-cluding samples from all racial and ethnic minority par-ticipants in the RPGEH enrolled through February 2011(the conclusion of DNA extraction for the project) withthe remainder of the cohort composed of samples fromnon-Hispanic white participants The resulting cohort in-cluded 192 individuals who identified themselves as

having African American Asian HispanicLatino NativeAmerican or Pacific Island raceethnicity with the remain-der (808) reporting themselves as non-Hispanic whiteThe cohort included 110266 subjects to ensure that atleast 100000 individuals were successfully genotyped bythe end of the project The institutional review boards forhuman subjects research of both KPNC and UCSF ap-proved the project

Saliva sample collection DNA extraction and transferof samplesdata

Following completion of a mailed health survey and returnby mail of a signed consent form the RPGEH collecteda saliva sample by mail from consented participants using anOragene OG-250 disc format saliva kit which is designed tobe sent through the mail in a padded envelope Each kit wasidentified with a unique barcode that was linked to theparticipantrsquos unique study identification number Along withthe kit participants received written instructions on provid-ing a saliva specimen following the protocol provided forcompletion of Oragene kits A telephone number was pro-vided for participants to call with questions Participantswere instructed to refrain from eating for 30 min prior tospitting in the kit and to rinse their mouths with water priorto providing the saliva sample Each kit holds about 4 ml ofsaliva a preservative in the cap of the kit is released whenthe participant screws on the lid of the kit after providinga saliva sample According to documentation provided bythe kitrsquos manufacturer the sealed kits would maintainDNA quality when stored at ambient temperature for a pe-riod of 5 years Returned kits were logged in weighed (todetect empty or low volume samples) and stored unopenedin a single layer in boxes in a storage room maintained atambient room temperature

DNA was extracted and normalized in a single laboratoryof the KPNC Division of Research beginning in November2009 and concluding in March 2011 in a fully automatedsystem with integrated tracking and data capture of allsamples and aliquots The average length of time a samplewas stored prior to DNA extraction was 48364 days (SD12390) ranging from a minimum of 10 days to a maximumof 916 days with the majority stored for 11ndash16 months(Figure S2) Saliva samples from a total of 124185 partic-ipants were used for the project Weighing and visual in-spection of opened saliva kits resulted in exclusion of2435 (2) from further processing due to low volume orparticulate matter in the saliva A sample of 05 ml of salivawas drawn and placed in two deep well blocks for DNAextraction with the remaining saliva placed in cryovialsand stored at 280 Following incubation of saliva samplesAgencourt-DNAdvance SPRI paramagnetic bead technologykits were used to extract DNA from 121750 samplesExtracted samples were quantified by PicoGreen (InvitrogenQuant-iT dsDNA assay kits) with a fluorescence intensity-topread via the DTX880 Multimode Detector (Beckman CoulterBrea CA) For most of the project samples that were within

1052 M N Kvale et al

an initial concentration range of 30ndash470 ngml were hit-picked and then normalized via a Span-8 Biomek FXP (Beck-man Coulter) in 10 mM Tris 01 mM EDTA pH 80 toa concentration of 10 ngml for genotyping To accommo-date the timing of the design of the ancestry-specific micro-arrays used for genotyping (Hoffmann et al 2011ab)samples from self-reported non-Hispanic white participantswere extracted first normalized and sent to the UCSFGenomics Core Facility for genotyping prior to the extractionand plating of DNA samples from self-reported minority par-ticipants To maximize the inclusion of ethnic minority par-ticipants in the project we reduced the lower end of theeligible initial concentration range to 15 from 30 ngmlOverall 104 of extracted samples were excluded becausethe concentration fell outside the specified range A subsetof DNA samples was also checked for purity using a NanoDropTechnologies ND-1000 spectrophotometer The A260280

ratios of these samples were in the range 15ndash19 with mostsamples between 17 and 18 The average yield of DNAfrom each 05 ml of saliva used was 486 mg A total of110266 samples of DNA were normalized into 96-wellplates The plates of normalized DNA linked a unique sam-ple identification number to the well position and platenumber of the DNA sample The plates of DNA were trans-ferred to the UCSF Genomics Core Facility along with a com-puter file that provided the link between the sampleidentification number and the sample well position andplate number The laboratory results were linked to thesame identification number genotyping calls were alsolinked to the same identifier and the file was returned toKaiser Permanente for linkage to survey data and electronicmedical records using a key that linked the original identi-fiers with the samples

Axiom platform design

The genotyping experiment used custom designed arraysbased on the Affymetrix Axiom Genotyping Solution(Affymetrix White Paper) It is a two-color ligation-basedassay utilizing on average 30-mer oligonucleotide probessynthesized in situ on a microarray substrate with auto-mated parallel processing of 96 samples per plate with a to-tal of 138 million features available for experimentalcontent The design of the four ethnic-specific arrays hasbeen described previously (Hoffmann et al 2011ab) Theoriginal design of the Axiom arrays required probe sets foreach SNP to be included at least twice for QC reasons How-ever after initial experience with our first array for non-Hispanic whites (Hoffmann et al 2011a) we expandedthe number of SNPs on subsequent arrays by including someSNP probe sets only once based on high-performance char-acteristics (Hoffmann et al 2011b) The four arrays weredesigned to maximize coverage in non-Hispanic whites(EUR) East Asians (EAS) African Americans (AFR) andLatinos (LAT) The number of autosomal X-linked Y-linkedand mitochondrial SNPs on each of the four arrays is pro-vided in Table 1 As detailed below the presence of a higher

proportion of probe sets tiled once on the AFR and LATarrays affected their QC characteristics

Assay protocol

At UCSF a small set of successfully run plates wasdesignated as a source of duplicate samples for later platesOn each plate one of the wells was filled with a previously runsample (ie as a duplicate) to evaluate assay reproducibility asa QC measure going forward The remaining 95 wells wereoccupied by new samples These duplicate samples were theonly ones run more than one time in the study Most duplicatesamples were run on the same ethnic array as the originalsample However because all duplicates were from success-fully previously run samples for the first few plates for eacharray this was not possible Hence for the first few EAS AFRand LAT plates the duplicate sample came from a sample pre-viously run on the EUR array For the first few EUR plates theduplicate samples were from prior Affymetrix referencesamples

The samples were then processed using the standardAffymetrix Axiom sample prep protocol (Affymetrix Geno-typing Protocol) A major change in protocol that occurredduring the experiment after all EUR and EAS plates wereprocessed was the implementation of a novel reagent kitfrom Affymetrix The original reagent kit designated Axiom10 was primarily used for assaying the EUR and EASarrays while the updated reagent kit referred to as Axiom20 was used for assaying AFR LAT and later EUR and EASarrays The differences between Axiom 10 and Axiom 20are at the level of specific reagents in module 1 such ascosolvents used in the isothermal whole genome amplifica-tion The shift to Axiom 20 enabled the validation of moreSNPs overall and increased genome-wide coverage witha modest loss of previously validated SNPs on Axiom arraysThe reagent kit affected intensity cluster centers in thegenotype clustering process and hence needed to becontrolled for during genotyping as described below

Samples were generally randomized across Axiomplates within array type but some nonrandom structurewas necessary from pragmatic considerations First sub-jects on each array type were genotyped together in-dependently of the other array types Second the firstplates assayed with the EUR arrays contained many of theoldest subjects (ages 85 and older) whose samples hadbeen collected first to maximize their participation in theresearch The first samples were then prioritized for DNAextraction and genotyping resulting in a higher percentageof samples from elderly subjects in the first group of platesof non-Hispanic whites to be genotyped Most platescontained a random distribution of females vs males(according to the overall 5842 ratio of females to malesin the cohort) However because the GERA cohort alsoincluded a subset of male subjects from the CaliforniaMenrsquos Health Study (Enger et al 2006) who were pro-cessed during a distinct time period some plates containeda majority of male subjects

Genome-Wide Genotyping of GERA Cohort 1053

Once samples were prepared and embedded into Axiomassay plates they were assayed on Axiom Gene Titans Thestandard Affymetrix protocol was generally followed withthe exception of the hybridization time The experimentbegan with the standard 24-hr hybridization interval but itwas soon found to be advantageous to hybridize for longerperiods of 48ndash72 hr to maximize the signal-to-noise ratioBoth 48- and 72-hr hybridization times gave similar perfor-mance and both were used to facilitate lab scheduling

Genotype calling protocol

To rapidly identify system problems affecting genotypequality a short-term automated quality control feedbackcycle was created for the earliest possible detection andrepair of major problems in the assay The feedback cyclewas based on a data-driven automatic analysis pipelineusing Affymetrix Power Tools (APT) As soon as a GeneTitan completed its scanning phase CEL files were trans-ferred to a Linux cluster for analysis There all successfullyassayed samples from a single plate were genotypedtogether and the results sent via E-mail to the project staff

The genotype calling was based on the standard three-step process as described in the Affymetrix analysis guide(Figure 1) (Affymetrix Axiom Analysis Guides 2015) Fora sample to pass quality control it required a DishQC(DQC) score $082 and a first pass sample call rate (CR1)$97 DishQC is a measure of the contrast between the ATand GC signals assayed in nonpolymorphic test sequences(Affymetrix White Paper) It provides a type of signal-to-noise figure of merit that is well correlated with sample callrate allowing for prediction of successful samples

It was found advantageous to create custom Bayesianpriors for genotyping intensity cluster centers and covariancesrather than using the standard Affymetrix supplied priorsTo create custom Bayesian custom priors samples from 8ndash12 high-performing plates distributed across time and assayconditions were genotyped together using weak genericpriors The resulting posterior distribution statistical param-eters became the new custom priors Custom priors provedsuperior for two reasons saliva-based DNA tended to pro-duce different cluster centers than the Affymetrix blood-based priors and increasing the hybridization time alsotended to shift cluster centers

In addition to the genotyping results montages of arraydigital images were used to visually identify errors inprocessing and sources of unexpected noise (Figure S3)These montages proved especially valuable in diagnosingproblems in a prompt fashion

While short-term monitoring of the genotype assayallowed discovery and correction of acute problems itdid not address changes over time Therefore to trackgradually developing problems (such as with equipmentand consumables) we created raster plots of the DQC andfirst-pass call rate (CR1) to monitor performance overtime Figure 2 shows the distribution of DQC and CR1 foreach Axiom plate assayed in the experiment

Reproducibility analysis

To detect changes in the assay over the course of theexperiment we analyzed discordance of duplicate samples runon different plates as described above The gap in plate numberbetween the original and duplicate assay was varied to givea range of intervals to examine Figure S4 shows the distributionof original vs duplicate sample plate numbers This distributionallowed for detection of changes over a course of months

Genotype discordance is defined for a pair of duplicatesamples as the proportion of genotype pairs called on bothsamples that differ from each other relative to the number ofgenotype pairs for which both genotypes are called Discor-dance can be categorized as a single allele difference (onesample call being a homozygote and the other a heterozy-gote) or as a two-allele difference (the two samples beingcalled opposite homozygotes) The large majority of dis-cordances involve single-allele differences

Discordant genotype pairs necessarily contain at leastone minor allele hence a genotype discordance rate fora SNP is limited by minor allele frequency (MAF) For rare SNPsit is typically the minor allele genotypes that are important andhence the discordance rate among the heterozygous and minorallele homozygous genotypes becomes of interest as opposedto the majority of genotype pairs which are major allelehomozygotes and concordant To address discordance in thesecases we define a minor allele pair (MAP) as a genotype pairboth genotypes called in which at least one genotype includesa minor allele Then a useful measure of discordance amongsuch pairs is the genotype discordance rate per MAP

discordance per MAP frac14 discordant pairs=minor allele pairs

Genotype-calling package design

At various points along the duplicate plate number axisreproducibility analysis showed sudden changes in genotypeconcordance By conducting a change-point analysis andcorrelating the changes with known experimental eventsfactors affecting the probe intensity cluster centers were

Table 1 Number and type of SNPs assayed and passing QC by array

Array SNPs assayed SNPs tiled once Autosomal X-linked Y-chrom mtDNA Passing QC of SNPs passing

EUR 674518 0 660990 13123 289 116 670572 9942EAS 712950 65473 699324 13385 158 83 708373 9936AFR 893631 429451 867035 26264 234 98 878176 9827LAT 817810 282901 792056 25397 234 123 802186 9809

1054 M N Kvale et al

discovered These factors along with known prior factors suchas array type formed the categorical dimensions we used topartition the samples for final genotype calling of the cohortThe factors were chosen to maximize within-package homo-geneity and hence optimize the empirical genotype concor-dance across duplicate samples in different parts of thepartition The factors driving this partition included array typehybridization time reagent kit type reagent lot and initial(low) DNA concentration Low-concentration samples includedthose with initial concentrations between 15 and 30 ngmlthose 30 ngml were considered to be in the normal range

On the basis of these factors the samples were parti-tioned into ldquopackagesrdquo consisting of samples from 2ndash26plates to achieve homogeneity of conditions Each packagethen underwent genotype calling separately

Genotype postprocessing

After the package-based genotypes were created genotyp-ing quality for each probe set was assessed Genotypes forpoorly performing probe sets in each package were filteredout of the final data set this was called per-packagefiltering Probe set performance was also assessed acrossall packages for a particular array type and poorly perform-ing probe sets across packages were filtered out of allpackages for that array this was termed per-array filteringIn filtering genotypes and probe sets within and acrosspackages we employed liberal call rate thresholds to retainthe maximum amount of potentially useful data Thereforedepending on the particular use further genotype filteringmay be appropriate

The filtering pipeline included five steps as follows

1 Per-package filtering of probe sets with a call rate 902 Filtering of probe sets with large allele frequency varia-

tion across packages within an array Specifically if pi isthe allele frequency for package i and p9 is the averageallele frequency across packages then

variancethinsp ratio frac14 p9eth12 p9THORN=VarethpiTHORN

A large variance ratio indicates a stable allele frequencyacross packages Probe sets with variance ratio 31for a given array were filtered out

3 Filtering of autosomal probe sets based on a large allelefrequency difference (015) between males andfemales

4 Filtering of probe sets with poor overall performanceSpecifically the per-array genotyping rate for each probeset was calculated as the total number of genotypesacross all packages for that array that are called vsattempted and those with overall call rate 060 werefiltered out

5 Filtering of probe sets with poor genotype concordanceacross duplicates Specifically for each probe set and ar-ray type the number of discordant genotypes across allpairs of duplicate samples in which both samples wereassayed on the same array type was calculated Thenprobe sets for which the genotype discordance countexceeded array-dependent thresholds (208 discordantof 851 possible for EUR 23 discordant of 61 possiblefor EAS 8 discordant of 12 possible for AFR 26 discor-dant of 71 possible for LAT) were filtered out

The thresholds described in each step above are to somedegree arbitrary but were chosen to be conservative interms of number of SNPs removed The intention was toremove only SNPs with a very high probability of inaccuracythis meant that poor performing SNPs are likely remainingbut can be filtered out at later stages by end users

The threshold of 90 for per-package filtering waschosen through inspection of a statistical sample of SNPintensity plots at various SNP call rates Those SNPs witha call rate 90 invariably had problems with clustersplits highly overlapping clusters or pathological clusterstructure (such as a large number of off-target variantcalls) that rendered the genotypes uncertain Above90 SNPs with usable genotypes could be found withthe percentage of usable SNPs increasing with increasingcall rate

Figure 1 Standard Affymetrix Axiom Genotyping analysis workflow

Figure 2 First-pass sample call rate (CR1) vs DQC Black points are sam-ples assayed with Axiom 10 and red points are those with Axiom 20Threshold for a sample to pass is call rate $97

Genome-Wide Genotyping of GERA Cohort 1055

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 3: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

an initial concentration range of 30ndash470 ngml were hit-picked and then normalized via a Span-8 Biomek FXP (Beck-man Coulter) in 10 mM Tris 01 mM EDTA pH 80 toa concentration of 10 ngml for genotyping To accommo-date the timing of the design of the ancestry-specific micro-arrays used for genotyping (Hoffmann et al 2011ab)samples from self-reported non-Hispanic white participantswere extracted first normalized and sent to the UCSFGenomics Core Facility for genotyping prior to the extractionand plating of DNA samples from self-reported minority par-ticipants To maximize the inclusion of ethnic minority par-ticipants in the project we reduced the lower end of theeligible initial concentration range to 15 from 30 ngmlOverall 104 of extracted samples were excluded becausethe concentration fell outside the specified range A subsetof DNA samples was also checked for purity using a NanoDropTechnologies ND-1000 spectrophotometer The A260280

ratios of these samples were in the range 15ndash19 with mostsamples between 17 and 18 The average yield of DNAfrom each 05 ml of saliva used was 486 mg A total of110266 samples of DNA were normalized into 96-wellplates The plates of normalized DNA linked a unique sam-ple identification number to the well position and platenumber of the DNA sample The plates of DNA were trans-ferred to the UCSF Genomics Core Facility along with a com-puter file that provided the link between the sampleidentification number and the sample well position andplate number The laboratory results were linked to thesame identification number genotyping calls were alsolinked to the same identifier and the file was returned toKaiser Permanente for linkage to survey data and electronicmedical records using a key that linked the original identi-fiers with the samples

Axiom platform design

The genotyping experiment used custom designed arraysbased on the Affymetrix Axiom Genotyping Solution(Affymetrix White Paper) It is a two-color ligation-basedassay utilizing on average 30-mer oligonucleotide probessynthesized in situ on a microarray substrate with auto-mated parallel processing of 96 samples per plate with a to-tal of 138 million features available for experimentalcontent The design of the four ethnic-specific arrays hasbeen described previously (Hoffmann et al 2011ab) Theoriginal design of the Axiom arrays required probe sets foreach SNP to be included at least twice for QC reasons How-ever after initial experience with our first array for non-Hispanic whites (Hoffmann et al 2011a) we expandedthe number of SNPs on subsequent arrays by including someSNP probe sets only once based on high-performance char-acteristics (Hoffmann et al 2011b) The four arrays weredesigned to maximize coverage in non-Hispanic whites(EUR) East Asians (EAS) African Americans (AFR) andLatinos (LAT) The number of autosomal X-linked Y-linkedand mitochondrial SNPs on each of the four arrays is pro-vided in Table 1 As detailed below the presence of a higher

proportion of probe sets tiled once on the AFR and LATarrays affected their QC characteristics

Assay protocol

At UCSF a small set of successfully run plates wasdesignated as a source of duplicate samples for later platesOn each plate one of the wells was filled with a previously runsample (ie as a duplicate) to evaluate assay reproducibility asa QC measure going forward The remaining 95 wells wereoccupied by new samples These duplicate samples were theonly ones run more than one time in the study Most duplicatesamples were run on the same ethnic array as the originalsample However because all duplicates were from success-fully previously run samples for the first few plates for eacharray this was not possible Hence for the first few EAS AFRand LAT plates the duplicate sample came from a sample pre-viously run on the EUR array For the first few EUR plates theduplicate samples were from prior Affymetrix referencesamples

The samples were then processed using the standardAffymetrix Axiom sample prep protocol (Affymetrix Geno-typing Protocol) A major change in protocol that occurredduring the experiment after all EUR and EAS plates wereprocessed was the implementation of a novel reagent kitfrom Affymetrix The original reagent kit designated Axiom10 was primarily used for assaying the EUR and EASarrays while the updated reagent kit referred to as Axiom20 was used for assaying AFR LAT and later EUR and EASarrays The differences between Axiom 10 and Axiom 20are at the level of specific reagents in module 1 such ascosolvents used in the isothermal whole genome amplifica-tion The shift to Axiom 20 enabled the validation of moreSNPs overall and increased genome-wide coverage witha modest loss of previously validated SNPs on Axiom arraysThe reagent kit affected intensity cluster centers in thegenotype clustering process and hence needed to becontrolled for during genotyping as described below

Samples were generally randomized across Axiomplates within array type but some nonrandom structurewas necessary from pragmatic considerations First sub-jects on each array type were genotyped together in-dependently of the other array types Second the firstplates assayed with the EUR arrays contained many of theoldest subjects (ages 85 and older) whose samples hadbeen collected first to maximize their participation in theresearch The first samples were then prioritized for DNAextraction and genotyping resulting in a higher percentageof samples from elderly subjects in the first group of platesof non-Hispanic whites to be genotyped Most platescontained a random distribution of females vs males(according to the overall 5842 ratio of females to malesin the cohort) However because the GERA cohort alsoincluded a subset of male subjects from the CaliforniaMenrsquos Health Study (Enger et al 2006) who were pro-cessed during a distinct time period some plates containeda majority of male subjects

Genome-Wide Genotyping of GERA Cohort 1053

Once samples were prepared and embedded into Axiomassay plates they were assayed on Axiom Gene Titans Thestandard Affymetrix protocol was generally followed withthe exception of the hybridization time The experimentbegan with the standard 24-hr hybridization interval but itwas soon found to be advantageous to hybridize for longerperiods of 48ndash72 hr to maximize the signal-to-noise ratioBoth 48- and 72-hr hybridization times gave similar perfor-mance and both were used to facilitate lab scheduling

Genotype calling protocol

To rapidly identify system problems affecting genotypequality a short-term automated quality control feedbackcycle was created for the earliest possible detection andrepair of major problems in the assay The feedback cyclewas based on a data-driven automatic analysis pipelineusing Affymetrix Power Tools (APT) As soon as a GeneTitan completed its scanning phase CEL files were trans-ferred to a Linux cluster for analysis There all successfullyassayed samples from a single plate were genotypedtogether and the results sent via E-mail to the project staff

The genotype calling was based on the standard three-step process as described in the Affymetrix analysis guide(Figure 1) (Affymetrix Axiom Analysis Guides 2015) Fora sample to pass quality control it required a DishQC(DQC) score $082 and a first pass sample call rate (CR1)$97 DishQC is a measure of the contrast between the ATand GC signals assayed in nonpolymorphic test sequences(Affymetrix White Paper) It provides a type of signal-to-noise figure of merit that is well correlated with sample callrate allowing for prediction of successful samples

It was found advantageous to create custom Bayesianpriors for genotyping intensity cluster centers and covariancesrather than using the standard Affymetrix supplied priorsTo create custom Bayesian custom priors samples from 8ndash12 high-performing plates distributed across time and assayconditions were genotyped together using weak genericpriors The resulting posterior distribution statistical param-eters became the new custom priors Custom priors provedsuperior for two reasons saliva-based DNA tended to pro-duce different cluster centers than the Affymetrix blood-based priors and increasing the hybridization time alsotended to shift cluster centers

In addition to the genotyping results montages of arraydigital images were used to visually identify errors inprocessing and sources of unexpected noise (Figure S3)These montages proved especially valuable in diagnosingproblems in a prompt fashion

While short-term monitoring of the genotype assayallowed discovery and correction of acute problems itdid not address changes over time Therefore to trackgradually developing problems (such as with equipmentand consumables) we created raster plots of the DQC andfirst-pass call rate (CR1) to monitor performance overtime Figure 2 shows the distribution of DQC and CR1 foreach Axiom plate assayed in the experiment

Reproducibility analysis

To detect changes in the assay over the course of theexperiment we analyzed discordance of duplicate samples runon different plates as described above The gap in plate numberbetween the original and duplicate assay was varied to givea range of intervals to examine Figure S4 shows the distributionof original vs duplicate sample plate numbers This distributionallowed for detection of changes over a course of months

Genotype discordance is defined for a pair of duplicatesamples as the proportion of genotype pairs called on bothsamples that differ from each other relative to the number ofgenotype pairs for which both genotypes are called Discor-dance can be categorized as a single allele difference (onesample call being a homozygote and the other a heterozy-gote) or as a two-allele difference (the two samples beingcalled opposite homozygotes) The large majority of dis-cordances involve single-allele differences

Discordant genotype pairs necessarily contain at leastone minor allele hence a genotype discordance rate fora SNP is limited by minor allele frequency (MAF) For rare SNPsit is typically the minor allele genotypes that are important andhence the discordance rate among the heterozygous and minorallele homozygous genotypes becomes of interest as opposedto the majority of genotype pairs which are major allelehomozygotes and concordant To address discordance in thesecases we define a minor allele pair (MAP) as a genotype pairboth genotypes called in which at least one genotype includesa minor allele Then a useful measure of discordance amongsuch pairs is the genotype discordance rate per MAP

discordance per MAP frac14 discordant pairs=minor allele pairs

Genotype-calling package design

At various points along the duplicate plate number axisreproducibility analysis showed sudden changes in genotypeconcordance By conducting a change-point analysis andcorrelating the changes with known experimental eventsfactors affecting the probe intensity cluster centers were

Table 1 Number and type of SNPs assayed and passing QC by array

Array SNPs assayed SNPs tiled once Autosomal X-linked Y-chrom mtDNA Passing QC of SNPs passing

EUR 674518 0 660990 13123 289 116 670572 9942EAS 712950 65473 699324 13385 158 83 708373 9936AFR 893631 429451 867035 26264 234 98 878176 9827LAT 817810 282901 792056 25397 234 123 802186 9809

1054 M N Kvale et al

discovered These factors along with known prior factors suchas array type formed the categorical dimensions we used topartition the samples for final genotype calling of the cohortThe factors were chosen to maximize within-package homo-geneity and hence optimize the empirical genotype concor-dance across duplicate samples in different parts of thepartition The factors driving this partition included array typehybridization time reagent kit type reagent lot and initial(low) DNA concentration Low-concentration samples includedthose with initial concentrations between 15 and 30 ngmlthose 30 ngml were considered to be in the normal range

On the basis of these factors the samples were parti-tioned into ldquopackagesrdquo consisting of samples from 2ndash26plates to achieve homogeneity of conditions Each packagethen underwent genotype calling separately

Genotype postprocessing

After the package-based genotypes were created genotyp-ing quality for each probe set was assessed Genotypes forpoorly performing probe sets in each package were filteredout of the final data set this was called per-packagefiltering Probe set performance was also assessed acrossall packages for a particular array type and poorly perform-ing probe sets across packages were filtered out of allpackages for that array this was termed per-array filteringIn filtering genotypes and probe sets within and acrosspackages we employed liberal call rate thresholds to retainthe maximum amount of potentially useful data Thereforedepending on the particular use further genotype filteringmay be appropriate

The filtering pipeline included five steps as follows

1 Per-package filtering of probe sets with a call rate 902 Filtering of probe sets with large allele frequency varia-

tion across packages within an array Specifically if pi isthe allele frequency for package i and p9 is the averageallele frequency across packages then

variancethinsp ratio frac14 p9eth12 p9THORN=VarethpiTHORN

A large variance ratio indicates a stable allele frequencyacross packages Probe sets with variance ratio 31for a given array were filtered out

3 Filtering of autosomal probe sets based on a large allelefrequency difference (015) between males andfemales

4 Filtering of probe sets with poor overall performanceSpecifically the per-array genotyping rate for each probeset was calculated as the total number of genotypesacross all packages for that array that are called vsattempted and those with overall call rate 060 werefiltered out

5 Filtering of probe sets with poor genotype concordanceacross duplicates Specifically for each probe set and ar-ray type the number of discordant genotypes across allpairs of duplicate samples in which both samples wereassayed on the same array type was calculated Thenprobe sets for which the genotype discordance countexceeded array-dependent thresholds (208 discordantof 851 possible for EUR 23 discordant of 61 possiblefor EAS 8 discordant of 12 possible for AFR 26 discor-dant of 71 possible for LAT) were filtered out

The thresholds described in each step above are to somedegree arbitrary but were chosen to be conservative interms of number of SNPs removed The intention was toremove only SNPs with a very high probability of inaccuracythis meant that poor performing SNPs are likely remainingbut can be filtered out at later stages by end users

The threshold of 90 for per-package filtering waschosen through inspection of a statistical sample of SNPintensity plots at various SNP call rates Those SNPs witha call rate 90 invariably had problems with clustersplits highly overlapping clusters or pathological clusterstructure (such as a large number of off-target variantcalls) that rendered the genotypes uncertain Above90 SNPs with usable genotypes could be found withthe percentage of usable SNPs increasing with increasingcall rate

Figure 1 Standard Affymetrix Axiom Genotyping analysis workflow

Figure 2 First-pass sample call rate (CR1) vs DQC Black points are sam-ples assayed with Axiom 10 and red points are those with Axiom 20Threshold for a sample to pass is call rate $97

Genome-Wide Genotyping of GERA Cohort 1055

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 4: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Once samples were prepared and embedded into Axiomassay plates they were assayed on Axiom Gene Titans Thestandard Affymetrix protocol was generally followed withthe exception of the hybridization time The experimentbegan with the standard 24-hr hybridization interval but itwas soon found to be advantageous to hybridize for longerperiods of 48ndash72 hr to maximize the signal-to-noise ratioBoth 48- and 72-hr hybridization times gave similar perfor-mance and both were used to facilitate lab scheduling

Genotype calling protocol

To rapidly identify system problems affecting genotypequality a short-term automated quality control feedbackcycle was created for the earliest possible detection andrepair of major problems in the assay The feedback cyclewas based on a data-driven automatic analysis pipelineusing Affymetrix Power Tools (APT) As soon as a GeneTitan completed its scanning phase CEL files were trans-ferred to a Linux cluster for analysis There all successfullyassayed samples from a single plate were genotypedtogether and the results sent via E-mail to the project staff

The genotype calling was based on the standard three-step process as described in the Affymetrix analysis guide(Figure 1) (Affymetrix Axiom Analysis Guides 2015) Fora sample to pass quality control it required a DishQC(DQC) score $082 and a first pass sample call rate (CR1)$97 DishQC is a measure of the contrast between the ATand GC signals assayed in nonpolymorphic test sequences(Affymetrix White Paper) It provides a type of signal-to-noise figure of merit that is well correlated with sample callrate allowing for prediction of successful samples

It was found advantageous to create custom Bayesianpriors for genotyping intensity cluster centers and covariancesrather than using the standard Affymetrix supplied priorsTo create custom Bayesian custom priors samples from 8ndash12 high-performing plates distributed across time and assayconditions were genotyped together using weak genericpriors The resulting posterior distribution statistical param-eters became the new custom priors Custom priors provedsuperior for two reasons saliva-based DNA tended to pro-duce different cluster centers than the Affymetrix blood-based priors and increasing the hybridization time alsotended to shift cluster centers

In addition to the genotyping results montages of arraydigital images were used to visually identify errors inprocessing and sources of unexpected noise (Figure S3)These montages proved especially valuable in diagnosingproblems in a prompt fashion

While short-term monitoring of the genotype assayallowed discovery and correction of acute problems itdid not address changes over time Therefore to trackgradually developing problems (such as with equipmentand consumables) we created raster plots of the DQC andfirst-pass call rate (CR1) to monitor performance overtime Figure 2 shows the distribution of DQC and CR1 foreach Axiom plate assayed in the experiment

Reproducibility analysis

To detect changes in the assay over the course of theexperiment we analyzed discordance of duplicate samples runon different plates as described above The gap in plate numberbetween the original and duplicate assay was varied to givea range of intervals to examine Figure S4 shows the distributionof original vs duplicate sample plate numbers This distributionallowed for detection of changes over a course of months

Genotype discordance is defined for a pair of duplicatesamples as the proportion of genotype pairs called on bothsamples that differ from each other relative to the number ofgenotype pairs for which both genotypes are called Discor-dance can be categorized as a single allele difference (onesample call being a homozygote and the other a heterozy-gote) or as a two-allele difference (the two samples beingcalled opposite homozygotes) The large majority of dis-cordances involve single-allele differences

Discordant genotype pairs necessarily contain at leastone minor allele hence a genotype discordance rate fora SNP is limited by minor allele frequency (MAF) For rare SNPsit is typically the minor allele genotypes that are important andhence the discordance rate among the heterozygous and minorallele homozygous genotypes becomes of interest as opposedto the majority of genotype pairs which are major allelehomozygotes and concordant To address discordance in thesecases we define a minor allele pair (MAP) as a genotype pairboth genotypes called in which at least one genotype includesa minor allele Then a useful measure of discordance amongsuch pairs is the genotype discordance rate per MAP

discordance per MAP frac14 discordant pairs=minor allele pairs

Genotype-calling package design

At various points along the duplicate plate number axisreproducibility analysis showed sudden changes in genotypeconcordance By conducting a change-point analysis andcorrelating the changes with known experimental eventsfactors affecting the probe intensity cluster centers were

Table 1 Number and type of SNPs assayed and passing QC by array

Array SNPs assayed SNPs tiled once Autosomal X-linked Y-chrom mtDNA Passing QC of SNPs passing

EUR 674518 0 660990 13123 289 116 670572 9942EAS 712950 65473 699324 13385 158 83 708373 9936AFR 893631 429451 867035 26264 234 98 878176 9827LAT 817810 282901 792056 25397 234 123 802186 9809

1054 M N Kvale et al

discovered These factors along with known prior factors suchas array type formed the categorical dimensions we used topartition the samples for final genotype calling of the cohortThe factors were chosen to maximize within-package homo-geneity and hence optimize the empirical genotype concor-dance across duplicate samples in different parts of thepartition The factors driving this partition included array typehybridization time reagent kit type reagent lot and initial(low) DNA concentration Low-concentration samples includedthose with initial concentrations between 15 and 30 ngmlthose 30 ngml were considered to be in the normal range

On the basis of these factors the samples were parti-tioned into ldquopackagesrdquo consisting of samples from 2ndash26plates to achieve homogeneity of conditions Each packagethen underwent genotype calling separately

Genotype postprocessing

After the package-based genotypes were created genotyp-ing quality for each probe set was assessed Genotypes forpoorly performing probe sets in each package were filteredout of the final data set this was called per-packagefiltering Probe set performance was also assessed acrossall packages for a particular array type and poorly perform-ing probe sets across packages were filtered out of allpackages for that array this was termed per-array filteringIn filtering genotypes and probe sets within and acrosspackages we employed liberal call rate thresholds to retainthe maximum amount of potentially useful data Thereforedepending on the particular use further genotype filteringmay be appropriate

The filtering pipeline included five steps as follows

1 Per-package filtering of probe sets with a call rate 902 Filtering of probe sets with large allele frequency varia-

tion across packages within an array Specifically if pi isthe allele frequency for package i and p9 is the averageallele frequency across packages then

variancethinsp ratio frac14 p9eth12 p9THORN=VarethpiTHORN

A large variance ratio indicates a stable allele frequencyacross packages Probe sets with variance ratio 31for a given array were filtered out

3 Filtering of autosomal probe sets based on a large allelefrequency difference (015) between males andfemales

4 Filtering of probe sets with poor overall performanceSpecifically the per-array genotyping rate for each probeset was calculated as the total number of genotypesacross all packages for that array that are called vsattempted and those with overall call rate 060 werefiltered out

5 Filtering of probe sets with poor genotype concordanceacross duplicates Specifically for each probe set and ar-ray type the number of discordant genotypes across allpairs of duplicate samples in which both samples wereassayed on the same array type was calculated Thenprobe sets for which the genotype discordance countexceeded array-dependent thresholds (208 discordantof 851 possible for EUR 23 discordant of 61 possiblefor EAS 8 discordant of 12 possible for AFR 26 discor-dant of 71 possible for LAT) were filtered out

The thresholds described in each step above are to somedegree arbitrary but were chosen to be conservative interms of number of SNPs removed The intention was toremove only SNPs with a very high probability of inaccuracythis meant that poor performing SNPs are likely remainingbut can be filtered out at later stages by end users

The threshold of 90 for per-package filtering waschosen through inspection of a statistical sample of SNPintensity plots at various SNP call rates Those SNPs witha call rate 90 invariably had problems with clustersplits highly overlapping clusters or pathological clusterstructure (such as a large number of off-target variantcalls) that rendered the genotypes uncertain Above90 SNPs with usable genotypes could be found withthe percentage of usable SNPs increasing with increasingcall rate

Figure 1 Standard Affymetrix Axiom Genotyping analysis workflow

Figure 2 First-pass sample call rate (CR1) vs DQC Black points are sam-ples assayed with Axiom 10 and red points are those with Axiom 20Threshold for a sample to pass is call rate $97

Genome-Wide Genotyping of GERA Cohort 1055

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 5: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

discovered These factors along with known prior factors suchas array type formed the categorical dimensions we used topartition the samples for final genotype calling of the cohortThe factors were chosen to maximize within-package homo-geneity and hence optimize the empirical genotype concor-dance across duplicate samples in different parts of thepartition The factors driving this partition included array typehybridization time reagent kit type reagent lot and initial(low) DNA concentration Low-concentration samples includedthose with initial concentrations between 15 and 30 ngmlthose 30 ngml were considered to be in the normal range

On the basis of these factors the samples were parti-tioned into ldquopackagesrdquo consisting of samples from 2ndash26plates to achieve homogeneity of conditions Each packagethen underwent genotype calling separately

Genotype postprocessing

After the package-based genotypes were created genotyp-ing quality for each probe set was assessed Genotypes forpoorly performing probe sets in each package were filteredout of the final data set this was called per-packagefiltering Probe set performance was also assessed acrossall packages for a particular array type and poorly perform-ing probe sets across packages were filtered out of allpackages for that array this was termed per-array filteringIn filtering genotypes and probe sets within and acrosspackages we employed liberal call rate thresholds to retainthe maximum amount of potentially useful data Thereforedepending on the particular use further genotype filteringmay be appropriate

The filtering pipeline included five steps as follows

1 Per-package filtering of probe sets with a call rate 902 Filtering of probe sets with large allele frequency varia-

tion across packages within an array Specifically if pi isthe allele frequency for package i and p9 is the averageallele frequency across packages then

variancethinsp ratio frac14 p9eth12 p9THORN=VarethpiTHORN

A large variance ratio indicates a stable allele frequencyacross packages Probe sets with variance ratio 31for a given array were filtered out

3 Filtering of autosomal probe sets based on a large allelefrequency difference (015) between males andfemales

4 Filtering of probe sets with poor overall performanceSpecifically the per-array genotyping rate for each probeset was calculated as the total number of genotypesacross all packages for that array that are called vsattempted and those with overall call rate 060 werefiltered out

5 Filtering of probe sets with poor genotype concordanceacross duplicates Specifically for each probe set and ar-ray type the number of discordant genotypes across allpairs of duplicate samples in which both samples wereassayed on the same array type was calculated Thenprobe sets for which the genotype discordance countexceeded array-dependent thresholds (208 discordantof 851 possible for EUR 23 discordant of 61 possiblefor EAS 8 discordant of 12 possible for AFR 26 discor-dant of 71 possible for LAT) were filtered out

The thresholds described in each step above are to somedegree arbitrary but were chosen to be conservative interms of number of SNPs removed The intention was toremove only SNPs with a very high probability of inaccuracythis meant that poor performing SNPs are likely remainingbut can be filtered out at later stages by end users

The threshold of 90 for per-package filtering waschosen through inspection of a statistical sample of SNPintensity plots at various SNP call rates Those SNPs witha call rate 90 invariably had problems with clustersplits highly overlapping clusters or pathological clusterstructure (such as a large number of off-target variantcalls) that rendered the genotypes uncertain Above90 SNPs with usable genotypes could be found withthe percentage of usable SNPs increasing with increasingcall rate

Figure 1 Standard Affymetrix Axiom Genotyping analysis workflow

Figure 2 First-pass sample call rate (CR1) vs DQC Black points are sam-ples assayed with Axiom 10 and red points are those with Axiom 20Threshold for a sample to pass is call rate $97

Genome-Wide Genotyping of GERA Cohort 1055

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 6: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

The threshold allele frequency variance ratio of 31 wasalso chosen empirically Figure S5 shows a scatterplot of thenatural log variance ratio vs the minor allele frequency forSNPs on the EUR array One observes an approximate par-tition of SNPs to regions above and below the 31 thresholdvalue Inspection of intensity plots for SNPs 31 thresholdvalue showed them to invariably have low signal intensity inboth the A and B channels leading to a single cluster nearthe A = B axis The APT algorithm would sometimes call thisas an AA cluster sometimes as BB The resulting wide allelefrequency fluctuations produced a low variance ratio Above31 there were cases of SNPs with a large fraction of goodgenotypes with some packages showing cluster splits

In terms of sex difference in allele frequency if weassume equal allele frequency in males and females anda package size of 1000 individuals the probability ofobserving an allele frequency difference $015 is extremelysmall 10210 Thus even though we are examining on theorder of 107ndash108 SNP-package combinations the expectednumber of SNPs to exceed the threshold of 015 by chance isclose to 0 rendering SNPs falling beyond that threshold aslikely pathologic

For the thresholds based on genotype discordance wechose to use a conservative cutoff of 10 or greater for thegenotyping error rate In comparing two samples a genotyp-ing error rate of 10 would produce a discordance rate of20 since either duplicate sample can produce a genotyp-ing error Sample discordance is an estimate of true discor-dance and the finite sample counts lead to two sigmaconfidence bounds on the estimate

Some important SNPs were interrogated with two probesets to improve the chance for reliable results one whosesequence was taken from the forward strand and one whosesequence was taken from the reverse strand For SNPs withmultiple probe sets that survived filtering the best perform-ing probe set in terms of call rate was chosen to representthe SNP and the other was filtered out

Results

DNA quality and call rates

We examined a number of factors that could influence DNAquality and genotyping efficiency including seasonality ofspecimen collection (by month) duration of specimen storageprior to DNA extraction and age of the study participant Bymonth mean (SD) DNA concentration ranged from a low of11516 (SD 7509) in the month of March to a mean of14704 (SD 8766) in the month of July DNA concentrationswere lower in the winter and spring months of December toMay and higher in the summer and fall months of June toNovember (Figure S6) By ANOVA the amount of variance inDNA concentration accounted for by month of collection was15 (F = 15759 df = 11 P 00001) Genotyping callrates (CR1) followed a parallel seasonal pattern (Figure S7)with the lowest mean call rate occurring in the month of

December (9934 SD 050) and the highest in the monthof June (9950 SD 046) By ANOVA the month of samplecollection explained 14 of the variance in CR1 (F= 13662df = 11 P 00001)

There was a modest but statistically significant inverseassociation of length of sample storage with DNA concen-tration (F = 4813 df = 1 P 00001) but it explainedonly 004 of the variance in DNA concentration (FigureS8) There was a stronger negative association betweenlength of storage and CR1 (F = 38909 df = 1 P 00001) that accounted for 364 of the variance in callrates (Figure S9) The relationship of both DNA concentra-tion and CR1 appeared to be nonlinear however where theshortest storage periods did not have the highest DNA con-centrations or initial call rates

We noted a positive correlation between age of studysubject and DNA concentration (r = 0120 F = 16036df = 1 P 00001) that explained 14 of the variance(Figure S10) There was a similarly positive but less signifi-cant correlation between subject age and CR1 (r= 0033 F=11236 df = 1 P 00001) that explained 01 of thevariance (Figure S11)

We also examined the relationship of DNA concentrationwith CR1 Except at the extremes of DNA concentrationsample call rate was fairly independent of DNA concentra-tion (Figure S12) although there was a slight negative trendoverall that was statistically significant (P 00001) andexplained 01 of the variance of CR1

Relationship between DQC and CR1

We observed a moderate positive relationship between thetwo QC criteria DQC and CR1 for both Axiom 10 andAxiom 20 assays (Figure 2) The figure also illustrates thecontinuous nature of both measures and the possibility ofrelaxing the DQC threshold to rescue some Axiom 10 sam-ples with passing CR1 values but failing DQC scores Differ-ences between Axiom 10 and Axiom 20 are also quiteapparent While Axiom 20 produced higher DQC scoresoverall than did Axiom 10 there is a stronger positive cor-relation between DQC and CR1 (by linear regression ad-justed R2 = 0468 P 00001) with Axiom 20 than withAxiom 10 (by linear regression adjusted R2 = 0160 P 00001) Comparing specifically the results for EUR arrays runwith the Axiom 10 assay vs the Axiom 20 assay the DQCpassing rate for Axiom 10 was 951 vs 980 for Axiom20 The CR1 passing rate for Axiom 10 was 986 vs 978for Axiom 20 Overall while the CR1 results for Axiom 10are superior they are outweighed by the higher performanceon DQC for Axiom 20 so that the total passing rate for Axiom10 was 938 vs 958 for Axiom 20 Note that this wastrue even though the EUR array was designed for SNPs vali-dated on the Axiom 10 assay

The distribution of DQC and CR1 by plate over the entireexperiment also shows some trends (Figure 3 A and B re-spectively) The higher DQC scores with Axiom 20 vs Ax-iom 10 are apparent (mean for Axiom 20 = 0962 mean

1056 M N Kvale et al

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 7: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

for Axiom 10 = 0929 t-test = 16266 P 00001) as isa cyclical pattern during the processing of EUR with Axiom10 (Figure 3A) By contrast CR1 shows the opposite pat-tern (Figure 3B)mdashnamely lower CR1 scores with Axiom 20compared to Axiom 10 (mean for Axiom 20 = 9875 meanfor Axiom 10 = 9943 t-test = 27374 P 00001) Theseresults are also consistent with the pattern observed in Fig-ure 2 Figure 3 also shows a few episodes of badly perform-ing plates which were primarily due to a performanceproblem with one of the Gene Titans This behavior alsodemonstrates why the real-time QC was critical to maintaina high level of performance

Package-based vs plate-based genotype calling andduplicate reproducibility

Initial plate-based genotype calling was performed in realtime and was used for QC assessment (CR1) Howeverpackage-based genotyping was found to have several

advantages over single plate-based genotyping (1) Use ofgeneric weak cluster location priors that were package specificand eliminated the need for prespecified priors that might biasresults (2) improved detection of rare clusters (genotypes) byincluding larger numbers of individuals in a single analysis (3)improved genotype calls by grouping of plates assayed undersimilar conditions and (4) improved genotype concordance induplicate pairs of samples Therefore final genotypes werederived from the package-based genotype calling

To evaluate the improvement due to genotype calling inpackages by grouping plates assayed under similar conditions(items 1 and 3 above) we first examined plate medians forpackage-based genotype calling with custom priors vs plate-based genotype calling using standard Affymetrix priors (Fig-ure 4) A sizeable improvement in overall call rates can beobserved with an average increase of 1 in call rate that isstatistically significant (t-test = 1062 P 00001)

As a second evaluation we calculated duplicate genotypediscordance (item 4 above) both for the original plate-basedgenotype calling (CR1) and package-based genotype callingfor the 828 duplicate samples on the EUR array (Figure 5)The partition of the cohort into packages based on experi-mental factors and using custom priors significantly in-creased the overall concordance For example mediandiscordance decreased from 06 to 03 with package-based genotype calling a difference that is statistically sig-nificant (comparing plate- vs package-based discordance bypaired t-test t-test = 21816 P 00001)

To investigate the effect of MAF on SNP discordance induplicate samples we evaluated the genotype discordancerate per MAP as a function of MAF comparing plate-generated genotypes and package-generated genotypes(Figure 6) For both plate and package genotypes the dis-cordance per MAP increases as SNP MAF decreases with thegreatest discordance seen in SNPs with MAF 001 Thisreflects the difficulty of genotyping rare SNPs with dominantmajor homozygous clusters

Figure 3 Distribution of (A) DQC and (B) first-pass call rate (CR1) for each(96 well) Axiom plate assayed in the experiment Black bars indicate platemedian and red error bars are 61 median absolute deviation Blue linesindicate boundaries between array types For example EUR-10 indicatesarray type EUR and Axiom 10 assay EUR-20 indicates array EUR withAxiom 20 assay

Figure 4 Plate medians of sample call rates for package-based genotypecalling using custom priors vs plate-based genotype calling using stan-dard Affymetrix priors for a subset of LAT-20 plates

Genome-Wide Genotyping of GERA Cohort 1057

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 8: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

It is also seen in Figure 6 that for each MAF class thediscordance per MAP is higher for the plate-generated geno-types than for the package-generated genotypes The gapbetween plate and package discordance increases with de-creasing MAF While package -based genotyping can de-crease discordance and thus improve reproducibility ofgenotypes of SNPs of all MAFs it improves reproducibilitymost dramatically for rare SNPs (item 2 above) All the dif-ferences in discordance for plate- vs package-generated geno-types are highly statistically significant by MannndashWhitneyU-tests (for MAF 010 plate mean = 00216 packagemean = 00070 P 00001 for 005 MAF 010 platemean = 00593 package mean = 00251 P 00001 for001 MAF 005 plate mean = 0113 package mean =00477 P 00001 for MAF 001 plate mean = 0445package mean = 0167 P 00001)

Overall sample and SNP success rates

The sample genotyping success rate was 9384 overallwith slightly higher rates for individuals run on the AFRarray (9545) and EAS array (9508) compared to thoserun on the EUR (9385) and LAT (9208) arrays (Table2) In terms of SNP results (Table 1) the arrays with a sub-stantial proportion of single-tiled SNPs had slightly loweroverall SNP success rates (9827 for AFR and 9809 forLAT) than those with most or all SNPs tiled twice (9936for EAS and 9942 for EUR)

SNP characteristics by array

The MAF cumulative distribution for all retained SNPs foreach of the four arrays is provided in Figure 7 All arraysshow an approximate uniform frequency for MAF 010but a clear excess of SNPs with MAF 010 The AFR andLAT have the highest proportion of low MAF SNPs and the

EUR array the lowest The reason for this difference lies inthe design of the arrays (Hoffmann et al 2011ab) Becauseindividuals with mixed genetic ancestry were analyzed onthe LAT AFR and EAS arrays coverage of low-frequencyvariants in more than one ancestral group was attempted forthese arrays as opposed to the EUR array which was basedsolely on European ancestry

Discussion

Using standard QC criteria in this saliva-based DNA geno-typing experiment we achieved a high overall success rateboth for SNPs and samples (103067 individuals successfullygenotyped of 109837 assayed) Still that left 6770 ldquofailedrdquosamples that would not be usable for GWAS or other anal-yses When a sample fails the sample call rate criterion ofCR1 97 it does not mean that all genotypes in the sam-ple are inherently unreliable Typically some probe sets clus-ter poorly and others cluster well even for substandardsamples One strategy would be to kill the poorest perform-ing probe sets to increase the number of passing samplesHowever this approach creates a tradeoff between eliminat-ing poor SNPs and recovering previously failing samples Onthe one hand more samples are accepted but at the ex-pense of accepting fewer SNPs

While the DQC contrast measure is generally a goodpredictor of CR1 and thus sample success the correlationbetween them is not perfect (Figure 2) and has been shownto depend on factors such as the reagent the DNA sourceand the hybridization time By relaxing the DQC thresholdfrom 082 to 075 for example nominally failing samplesmay be recovered However at the same time it is likely thatthe recovered samples will have more genotyping errorsthan those passing the original higher DQC threshold

It is also possible to identify and remediate poorly perform-ing probe sets For example a cluster split is a problem that

Figure 5 Cumulative distribution of genotype discordance across all 828duplicate sample pairs assayed on the EUR array for package-based vsplate-based genotype calling Genotype discordance is defined for a pairof duplicate samples as the proportion of genotype pairs called on bothsamples that differ from each other Difference can include single alleledifferences (more common) or two allele differences (less common)

Figure 6 Cumulative distributions of genotype discordance rate perMAP grouped by MAF for EUR duplicate sample pairs The solid curvesshow discordance for package-generated genotypes and the dashedcurves show discordance for plate-generated genotypes

1058 M N Kvale et al

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 9: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

occurs when an apparently continuous intensity cluster is splitby APT into two or three different genotype clusters Thistypically happens when the cluster is not well approximated bya Gaussian probability distribution This was the main sourceof error found in the final genotype calling Because clustersplits often end up creating some type of aberrant pattern suchas large allele frequency variation between packages or poorreproducibility most cluster splits were filtered out by the post-processing pipeline But by altering the cluster separation(CSep) parameters in the APT clustering algorithm it ispossible to bias the likelihood function so that cluster splitsare less likely to occur Figure S13 shows a probe set clusteringin which a cluster split has occurred and Figure S14 shows thesame probe set in which the CSep parameter has been in-creased preventing the cluster split Altering the CSep param-eters for all probe sets is not recommended however becauseit also has the adverse effect of coalescing truly separate clus-ters Thus to effect this cure it is necessary to distinguish probesets that have cluster splits from those that do not One ap-proach is to create a support vector machine (SVM) classifier(Chang and Chih-Jen 2011) that can effectively discriminatecluster split probe sets from those that have no cluster splits

Another kind of problem that appears in genotyping isthe phenomenon of blemished SNPs Blemished SNPs occurwhen sample assays contain array artifacts The APTsoftware automatically masks out these artifacts generatingno-calls in the process The masking algorithm can at timesbe too aggressive however and can mask out too manygenotypes This has the effect of creating no-calls that haveintensity profiles within otherwise called genotype clusters

One solution to this problem is provided by recalling thewithin-cluster no-called genotypes using an altered APTalgorithm A simpler pragmatic alternative is to create theconvex hulls for each called cluster and to test if no-calls arecontained in any of the hulls if they are then they areassigned the genotype of that convex hull The convex hull isa conservative estimate of the true extent of a cluster so thatthe converted no-calls would always be calls in the absenceof artifact processing

As a consequence of our experience with the Axiom platformduring this experiment Affymetrix was able to improve theassay For example in response to observations provided byroutine monitoring of ldquomontagesrdquo during this study Affymetrixdeveloped and released new fluidics and imaging protocols forthe Axiom 20 assay in the GeneTitan MC instrument The newprotocol greatly attenuates unexpected noise in the images androutine monitoring of plate montages may no longer be re-

quired provided that plates are passing the ldquoplate QCrdquo tests(Affymetrix Axiom Analysis Guides 2015) Regarding probeperformance variance due to the number of probe set replicateson the array and other factors Affymetrix now recommendsusing a common core set of 150K probe sets for ldquosample QCrdquopurposes (Affymetrix Axiom Analysis Guides 2015)

In conclusion performing a large-scale genotyping exper-iment over an extended period of time required both real-time quality assurance and quality control to detect andcorrect problems and maintain consistent performance of theassay Such measures are effective in correcting short-termproblems but over the longer term gradual changes in theassay must also be controlled Use of duplicate pairs ofsamples across the whole assay allowed for detection ofchanges that affected genotype reproducibility By partition-ing the sample set into packages of samples assayed undersimilar conditions genotype reproducibility was improvedand sample success rate was increased After genotypecalling a conservative filtering pipeline was implemented toremove data considered too poor to be of use to downstreamusers Continuing work on improving identification of poorlyperforming SNPs may enable us to include more SNPs andmore samples for future downstream analyses

Acknowledgments

We thank Judith A Millar for her skillful efforts inmanuscript preparation for submission We are gratefulto the Kaiser Permanente Northern California members whohave generously agreed to participate in the Kaiser PermanenteResearch Program on Genes Environment and Health The titleof the dbGaP Parent study is The Kaiser Permanente ResearchProgram on Genes Environment and Health (RPGEH) GOProject IRB CN-09CScha-06-H This work was supported bygrant RC2 AG036607 from the National Institutes of HealthParticipant enrollments survey and sample collection for theRPGEH were supported by grants from the Robert Wood

Table 2 Number of samples assayed and passing QC by array

Array Assayed Passing QC of all passed

samples of samples

passing

EUR 84430 79236 7688 9385EAS 8043 7647 742 9508AFR 5779 5516 535 9545LAT 11585 10668 1035 9208Total 109837 103067 1000 9384

Figure 7 Cumulative distribution of SNP MAF across all passing samplesfor each array

Genome-Wide Genotyping of GERA Cohort 1059

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 10: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Johnson Foundation the Ellison Medical Foundation the Wayneand Gladys Valley Foundation and Kaiser PermanenteInformation about data access can be obtained at httpwwwncbinlmnihgovprojectsgapcgi-binstudycgistudy_id=phs000674v1p1 and httpsrpgehportalkaiserorg

Note added in proof See Lapham et al 2015 (pp 1061ndash1072) and Banda et al 2015 (pp 1285ndash1295) in this issuefor related works

Literature Cited

Affymetrix Axiom Analysis Guides 2015 httpwwwaffymetrixcomsupportdownloadsmanualsaxiom_genotyping_solution_analysis_guidepdf

Affymetrix Genotyping Protocol 2015 Axiom 20 Assay AutomatedWorkflow httpmediaaffymetrixcomsupportdownloadsmanualsaxiom_2_assay_auto_workflow_user_guidepdf

Affymetrix White Paper 2009 httpmediaaffymetrixcomsupporttechnicaldatasheetsaxiom_genotyping_solution_datasheetpdf

Chang C-C and C-J Lin 2011 LIBSVM a library for supportvector machines ACM Trans Intell Syst Technol 2 1ndash27

Enger S M S K Van den Eeden B Sternfeld R K Loo C PQuesenberry Jr et al 2006 California Menrsquos Health Study(CMHS) a multiethnic cohort in a managed care setting BMCPublic Health 6 172

Hoffmann T J M N Kvale S E Hesselson Y Zhan C Aquinoet al 2011a Next generation genome-wide association tooldesign and coverage of a high-throughput European-optimizedSNP array Genomics 98 79ndash89

Hoffmann T J Y Zhan M N Kvale S E Hesselson J Gollubet al 2011b Design and coverage of high throughput genotyp-ing arrays optimized for individuals of East Asian African Amer-ican and Latino raceethnicity using imputation and a novelhybrid SNP selection algorithm Genomics 98 422ndash430

Communicating editor N R Wray

1060 M N Kvale et al

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 11: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

GENETICSSupporting Information

wwwgeneticsorglookupsuppldoi101534genetics115178905-DC1

Genotyping Informatics and Quality Control for100000 Subjects in the Genetic Epidemiology

Research on Adult Health and Aging (GERA) CohortMark N Kvale Stephanie Hesselson Thomas J Hoffmann Yang Cao David Chan

Sheryl Connell Lisa A Croen Brad P Dispensa Jasmin Eshragh Andrea Finn Jeremy GollubCarlos Iribarren Eric Jorgenson Lawrence H Kushi Richard Lao Yontao Lu Dana Ludwig

Gurpreet K Mathauda William B McGuire Gangwu Mei Sunita Miles Michael MittmanMohini Patil Charles P Quesenberry Jr Dilrini Ranatunga Sarah Rowell Marianne Sadler

Lori C Sakoda Michael Shapero Ling Shen Tanu Shenoy David Smethurst Carol P SomkinStephen K Van Den Eeden Lawrence Walter Eunice Wan Teresa Webster Rachel A Whitmer

Simon Wong Chia Zau Yiping Zhan Catherine Schaefer Pui-Yan Kwok and Neil Risch

Copyright copy 2015 by the Genetics Society of AmericaDOI 101534genetics115178905

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 12: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 213

Figure S1 Distribution of GERA saliva sample collection dates by month and year

013

200013

400013

600013

800013

1000013

1200013

1400013

1600013

Jul-shy‐0

813

Sep-shy‐0813

Nov-shy‐0813

Jan-shy‐0913

Mar-shy‐0913

May-shy‐0913

Jul-shy‐0

913

Sep-shy‐0913

Nov-shy‐0913

Jan-shy‐1013

Mar-shy‐1013

May-shy‐1013

Jul-shy‐1

013

Sep-shy‐1013

Nov-shy‐1013

Jan-shy‐1113

Num

ber13

Month13 and13 Year13

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 13: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 313

Figure S2 Distribution of saliva sample storage times prior to processing in months

013

500013

1000013

1500013

2000013

2500013

3000013

3500013

113 313 513 713 913 1113 1313 1513 1713 1913 2113 2313 2513 2713 2913 3113

Num

ber13

Months13

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 14: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 413

Figure S3 A montage of digital array images for an early Axiom plate assay Note the problem in the upper right due to an assay problem Fast detection of such problems allowed us to fix them in a timely manner and avoid the problems in later plate assays

13

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 15: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 513

Figure S4 The distribution of Axiom plate numbers (up to plate 730) of duplicate control sample versus original control sample 13

13

13

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 16: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 613

Figure S5 A scatterplot of the natural log of the variance ratio (VR) versus mean allele frequency for each SNP in the EUR dataset The chosen variance ratio threshold of 31 is shown in red

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 17: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 713

Figure S6 Mean DNA concentration by month of the year

013

2013

4013

6013

8013

10013

12013

14013

16013

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 DN

A13 Co

ncen

tra5

on13

Month13

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 18: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 813

Figure S7 Mean initial call rate (CR1) by month of the year

992513

99313

993513

99413

994513

99513

995513

Jan13 Feb13 Mar13 Apr13 May13 Jun13 Jul13 Aug13 Sep13 Oct13 Nov13 Dec13

Mean13 Ca

ll13 Ra

te13

Month13

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 19: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 913

Figure S8 Mean DNA concentration by saliva sample storage time in 100 day intervals

11513

12013

12513

13013

13513

14013

14513

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 DN

A13 Co

ncen

tra5

on13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 20: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1013

Figure S9 Mean initial call rate (CR1) by saliva sample storage time in 100 day intervals

98913

9913

99113

99213

99313

99413

99513

99613

lt25013 250-shy‐35013 350-shy‐45013 450-shy‐55013 550-shy‐65013 650-shy‐75013 750-shy‐85013

Mean13 Ca

ll13 Ra

te13

Days13 of13 Storage13

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 21: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1113

Figure S10 Mean DNA concentration by age of study subject

013 2013 4013 6013 8013 10013 12013 14013 16013 18013

Mean13 DN

A13 Co

ncen

tra5

on13

Age13

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 22: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1213

Figure S11 Mean initial call rate (CR1) by age of study subject

99313

993513

99413

994513

99513

995513 Mean13 Ca

ll13 Ra

te13

Age13

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 23: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1313

Figure S12 Mean initial call rate (CR1) by DNA concentration

993913

99413

994113

994213

994313

994413

994513

994613 Mean13 Ca

ll13 Ra

te13

DNA13 Concentra5on13

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 24: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1413

Figure S13 A normalized log intensity plot showing an example of a cluster split of a dominant AA cluster into incorrect AA and AB clusters The cluster split also forces the miscall of the AB cluster as an incorrect BB cluster

13

13

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13

Page 25: Genotyping Informatics and Quality Control for 100,000 ...Affymetrix Axiom Genotyping Solution. Because genotyping took place ov er a short 14-month period, creating a near-real-time

Si13 M13 Kvale13 et13 al13 13 1513

Figure S14 The same probe set as in Figure S13 but with the APT CSepPen parameter set at 015 instead of 010 The altered parameter avoids a cluster split and yields more accurate AA and AB clusters

13