06 geotyping and imputation 2019 - kursused€¦ · anrv386-gg10-18 ari 29 july 2009 15:32 snp...

11
Genetic variation Kaur Alasoo 7 March 2019

Upload: others

Post on 05-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

Genetic variationKaur Alasoo

7 March 2019

Page 2: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

Topic for today

• What is genetic variation?

• How much genetic variation is there between individuals?

• What are important properties of genetic data (LD, HWE)?

• How do we measure genetic differences? How do we impute missing data?

• What can we do with genetic data?

Page 3: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

Genotyping and imputation

Page 4: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

How genotyping works

• https://www.youtube.com/watch?v=Naona1y_I2U&feature=youtu.be&t=59s

Page 5: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

ANRV386-GG10-18 ARI 29 July 2009 15:32

SNP mapping arrays (which assay ∼10,000SNPs) (94). The data were then used to studythe genetic architecture of a variety of quanti-tative traits, ranging from body mass index (94)to fetal hemoglobin levels (106) to personal-ity traits (101). Clearly, family-based genotypeimputation will be maximally useful in samplesthat include very large numbers of related indi-viduals. In these settings, genotypes for a rel-atively modest number of individuals can bepropagated to many other additional individ-uals, thereby increasing power. However im-puting genotypes for known relatives of theindividuals included in a GWAS of mostly un-related individuals will always increase power(15) and should be considered whenever phe-notyped relatives for the individuals to be geno-typed in a scan are available. This will often bethe case when individuals to be scanned are se-lected from a larger sample of related individu-als that was previously phenotyped for linkageanalyses or family-based association testing.

IMPUTATION IN SAMPLES OFUNRELATED INDIVIDUALSAnalyses of related individuals provide the intu-ition behind genotype imputation: Whenever aparticular stretch of chromosome is examinedin detail in at least one individual, we learnabout the genotypes of many other individu-als who inherit that same stretch IBD. Whenstudying samples of apparently unrelated indi-viduals, the same approach can be utilized. Themajor difference is that, when studying appar-ently unrelated individuals, shared haplotypestretches will be much shorter (because com-mon ancestors are more distant) and thus maybe harder to identify with confidence. The in-tuition that short stretches of haplotype pro-vide useful information about untyped geneticmarkers provides the justification for the powergains suggested for many haplotype analysisstrategies (22, 60, 91, 115).

The mechanics of genotype imputationin unrelated individuals are illustrated inFigure 2. Here, study samples genotyped fora relatively large number of genetic markers

....A.......A....A...

....G.......C....A...

CGAGATCTCCTTCTTCTGTGCCGAGATCTCCCGACCTCATGGCCAAGCTCTTTTCTTCTGTGCCGAAGCTCTTTTCTTCTGTGCCGAGACTCTCCGACCTTATGCTGGGATCTCCCGACCTCATGGCGAGATCTCCCGACCTTGTGCCGAGACTCTTTTCTTTTGTACCGAGACTCTCCGACCTCGTGCCGAAGCTCTTTTCTTCTGTGC

....A.......A....A...

....G.......C....A...

CGAGATCTCCTTCTTCTGTGCCGAGATCTCCCGACCTCATGGCCAAGCTCTTTTCTTCTGTGCCGAAGCTCTTTTCTTCTGTGCCGAGACTCTCCGACCTTATGCTGGGATCTCCCGACCTCATGGCGAGATCTCCCGACCTTGTGCCGAGACTCTTTTCTTTTGTACCGAGACTCTCCGACCTCGTGCCGAAGCTCTTTTCTTCTGTGC

cgagAtctcccgAcctcAtggcgaaGctcttttCtttcAtgg

CGGCCCCCGGCAATTTTTTTTCGAGATCTCCCGACCTCATGGCCAAGCTCTTTTCTTCTGTGCCGAAGCTCTTTTCTTCTGTGCCGAGACTCTCCGACCTTATGCTGGGATCTCCCGACCTCATGGCGAGATCTCCCGACCTTGTGCCGAGACTCTTTTCTTTTGTACCGAGACTCTCCGACCTCGTGCCGAAGCTCTTTTCTTCTGTGC

Study sample a

Reference haplotypes

Study sample

Reference haplotypes

b

Study sample

Reference haplotypes

c

Figure 2Genotype imputation in a sample of apparently unrelated individuals. (a) Theobserved data, which consists of genotypes at a modest number of geneticmarkers in each sample being studied and, in addition, of detailed informationon genotypes (or haplotypes) for a reference sample. (b) The process ofidentifying regions of chromosome shared between a study sample andindividuals in the reference panel. When a typical sample of European ancestryis compared to haplotypes in the HapMap reference panel, stretches of>100 kb in length are usually identified. With a larger reference panel, largershared segments would be expected. (c) Observed genotypes and haplotypesharing information have been combined to fill in a series of unobservedgenotypes in the study sample.

(perhaps 100,000–1,000,000) are compared toa reference panel of haplotypes that includesdetailed information on a much larger numberof markers (Figure 2a). To date, the HapMapConsortium database has typically served as thereference panel (104), but we expect that in thefuture larger sets of individuals characterizedat larger numbers of markers will be available.Stretches of shared haplotype are then iden-tified (Figure 2b) and missing genotypes foreach study sample can be filled in by copyingalleles observed in matching reference haplo-types (Figure 2c). In analyses of samples ofEuropean ancestry, comparisons with geno-types for the HapMap CEU panel typicallyyield shared haplotypes that range from about

www.annualreviews.org • Genotype Imputation 391

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

009.

10:3

87-4

06. D

ownl

oade

d fr

om w

ww

.ann

ualre

view

s.org

Acc

ess p

rovi

ded

by U

nive

rsity

of T

artu

on

03/1

9/18

. For

per

sona

l use

onl

y.

ANRV386-GG10-18 ARI 29 July 2009 15:32

Genotype ImputationYun Li,1 Cristen Willer,1 Serena Sanna,2and Goncalo Abecasis1

1Center for Statistical Genetics, Department of Biostatistics, University of Michigan,Ann Arbor, Michigan 48109-2029; email: [email protected]; [email protected] di Neurogenetica e Neurofarmacologia, Consiglio Nazionale delle Ricerche,Cagliari, Italy

Annu. Rev. Genomics Hum. Genet. 2009.10:387–406

The Annual Review of Genomics and Human Geneticsis online at genom.annualreviews.org

This article’s doi:10.1146/annurev.genom.9.081307.164242

Copyright c⃝ 2009 by Annual Reviews.All rights reserved

1527-8204/09/0922-0387$20.00

Key Wordswhole genome association study, resequencing, association study,HapMap, 1000 Genomes Project

AbstractGenotype imputation is now an essential tool in the analysis of genome-wide association scans. This technique allows geneticists to accuratelyevaluate the evidence for association at genetic markers that are not di-rectly genotyped. Genotype imputation is particularly useful for com-bining results across studies that rely on different genotyping platformsbut also increases the power of individual scans. Here, we review thehistory and theoretical underpinnings of the technique. To illustrateperformance of the approach, we summarize results from several genemapping studies. Finally, we preview the role of genotype imputationin an era when whole genome resequencing is becoming increasinglycommon.

387

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

009.

10:3

87-4

06. D

ownl

oade

d fr

om w

ww

.ann

ualre

view

s.org

Acc

ess p

rovi

ded

by U

nive

rsity

of T

artu

on

03/1

9/18

. For

per

sona

l use

onl

y.

Page 6: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

ANRV386-GG10-18 ARI 29 July 2009 15:32

�LDLR locus and LDL cholesterol

0

1

2

3

4

5

6

7

–log

10 p−v

alue

for L

DL

Imputed Affy500k

Illu317k

Imputed SNPsAffy 500k

0

20

40

60

80

100

cM/M

b

10.9 11.0 11.1 11.2

Position on chromosome 19 (Mb)

CARM1 →←YIPF2

C19orf52 →SMARCA4 →

LDLR →←SPC24

←KANK2

←DOCK6

LOC55908 →

PDGI+SardiNIA = 1.7 x 10−6

Figure 3Association of genetic variants near LDLR with LDL-cholesterol levels. We use data from the SardiNIA (94)and Diabetes Genetics Initiative (DGI, 90) studies reported by Willer et al. (111). Evidence for association ateach SNP, measured as log10 p-value, is represented along the y-axis. The placement of each SNP along thex-axis corresponds to assigned chromosomal location in the current genome build. Results for directlygenotyped SNPs are colored in red, imputed SNPs in blue. Note that rs6511720, the SNP showingstrongest association in the region, is not well tagged by any of the variants on the Affymetrix genotypingarrays used in the SardiNIA and DGI studies. Evidence for association at the SNP increases to p < 10−25

after follow-up in >10,000 individuals in whom the SNP was genotyped directly (111). Association resultsare superimposed on a gray line that summarizes the local recombination rate map. The bottom panelindicates coding sequences in the region. The putative functional gene, LDLR, is highlighted in blue.

blood low density lipoprotein (LDL) choles-terol levels (Figure 3). The association signalwas missed in an initial analysis that consideredonly genotyped SNPs because rs6511720 is notincluded in the Affymetrix arrays used to scanthe genome in the majority of their samples andis only poorly tagged by individual SNPs on thechip (the best single marker tag is rs12052058with pairwise r2 of only 0.21). Another exam-ple we have encountered concerns the genome-wide association analysis of G6PD activity

levels in a sample of Sardinian individuals (77,94). There, analysis of directly genotyped SNPsrevealed two sets of SNPs strongly associated(p < 5 × 10−8) with G6PD activity levels,one near the G6PD gene locus on chromosomeX and another near the HBB locus on chro-mosome 11. Genotype imputation revealed astrong additional signal (also with p < 5 × 10−8)upstream of the 6PGD locus on chromo-some 1 (M. Uda, S. Sanna & D. Schlessinger,personal communication) (Figure 4). The

www.annualreviews.org • Genotype Imputation 395

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

009.

10:3

87-4

06. D

ownl

oade

d fr

om w

ww

.ann

ualre

view

s.org

Acc

ess p

rovi

ded

by U

nive

rsity

of T

artu

on

03/1

9/18

. For

per

sona

l use

onl

y.

Page 7: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

ANRV386-GG10-18 ARI 29 July 2009 15:32

75

100

0

25

50

0Prop

ortio

n of

gen

ome

cove

red

(%)

r2 between imputed and actual genotypes

N = 60 individualsN = 200 individualsN = 500 individuals

1.00.80.60.40.2

Figure 5Genome coverage as a function of reference panel size. The accuracy ofimputation increases with the number of individuals in the reference panel. Togenerate the figure, we analyzed genotyped data from the FUSION study (93).For any given r2 threshold, the results illustrate the proportion of markerswhose genotypes can be imputed with equal or greater accuracy. The resultsillustrate how the proportion of markers whose genotypes are recoveredaccurately (with high r2 between imputed and actual genotypes) increases withlarger reference panels.

for smaller ones (2, 37, 46–48, 65). Still, themost useful advance, in the context of geno-type imputation-based analyses, would be thedevelopment of larger reference panels. As il-lustrated in Figure 5, the accuracy of geno-type imputation-based analyses should increasesubstantially as the size of reference panels in-creases. This increase in accuracy occurs be-cause haplotype stretches shared between studysamples and samples in the reference panel in-crease in length and are easier to identify un-ambiguously with a larger reference panel.

IMPUTATION ANDGENOME-WIDERESEQUENCING DATASo far, we have focused our discussion on theanalysis of genotype data. However, genomesequencing technologies are also improving ex-tremely rapidly. Whereas the first two humanwhole genome assemblies took years to com-plete (49, 107), several additional genomes havebeen assembled just in the past 18 months (7,57, 110). These advances in whole genome se-quencing have resulted from the deployment ofmassive throughput sequencing technologies,

which differ from standard Sanger-based se-quencing (88) in many ways. For example, thedata produced by these new technologies typi-cally have somewhat higher error rates (on theorder of 1% per base). Since these technolo-gies produce very large amounts of data, onetypically accommodates these error rates by re-sequencing every site of interest many times toachieve a high-quality consensus.

We expect that the continued deploymentof these technologies will significantly changehow genotype imputation is used. An exam-ple is given by the 1000 Genomes Project (seehttp://www.1000genomes.org), which aimsto deliver whole genome sequences for >1000individuals from several different populationsin the next 12 months. To do this in a cost-effective manner, the project is using a strat-egy that combines massively parallel shotgunsequencing technology with the same statisticalmachinery used to drive genotype imputation-based analyses. Specifically, a relatively mod-est amount of shotgun sequence data is be-ing collected for each individual: Each of thetarget bases will be resequenced only 2–4x onaverage (statistical fluctuations around this av-erage mean that many bases will not be cov-ered even once), rather than the 20–40x usedin conventional applications of these technolo-gies to whole genome resequencing. To accu-rately call polymorphisms in each genome, theProject will then use imputation-based tech-niques to combine information across individ-uals who share a particular haplotype stretch.Using simulations, we have predicted that when400 diploid individuals are sequenced at only 2xdepth (1x per haploid genome) and the data areanalyzed using approaches that combine dataacross individuals sharing similar haplotypestretches, polymorphic sites with a frequencyof >2% can be genotyped with >99.5% accu-racy (Y. Li & G. Abecasis, unpublished data).Note that the same 2x average depth would notbe useful for genotype calling when examininga single individual—since, by chance, ∼37% ofalleles would not be sampled. For another ex-ample of how genotype imputation can be com-bined with sequence data, see Reference 72.

400 Li et al.

Ann

u. R

ev. G

enom

. Hum

. Gen

et. 2

009.

10:3

87-4

06. D

ownl

oade

d fr

om w

ww

.ann

ualre

view

s.org

Acc

ess p

rovi

ded

by U

nive

rsity

of T

artu

on

03/1

9/18

. For

per

sona

l use

onl

y.

Page 8: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

What can we do with genetic data?

Page 9: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

The direction of the PC1 axis and its relative strength may reflect aspecial role for this geographic axis in the demographic history ofEuropeans (as first suggested in ref. 10). PC1 aligns north-northwest/south-southeast (NNW/SSE, 216 degrees) and accounts forapproximately twice the amount of variation as PC2 (0.30% versus0.15%, first eigenvalue 5 4.09, second eigenvalue 5 2.04). However,caution is required because the direction and relative strength of thePC axes are affected by factors such as the spatial distribution ofsamples (results not shown, also see ref. 9). More robust evidencefor the importance of a roughly NNW/SSE axis in Europe is that, inthese same data, haplotype diversity decreases from south to north(A.A. et al., submitted). As the fine-scale spatial structure evident inFig. 1 suggests, European DNA samples can be very informativeabout the geographical origins of their donors. Using a multi-ple-regression-based assignment approach, one can place 50% of

individuals within 310 km of their reported origin and 90% within700 km of their origin (Fig. 2 and Supplementary Table 4, resultsbased on populations with n . 6). Across all populations, 50% ofindividuals are placed within 540 km of their reported origin, and90% of individuals within 840 km (Supplementary Fig. 3 andSupplementary Table 4). These numbers exclude individuals whoreported mixed grandparental ancestry, who are typically assignedto locations between those expected from their grandparental origins(results not shown). Note that distances of assignments fromreported origin may be reduced if finer-scale information on originwere available for each individual.

Population structure poses a well-recognized challenge for disease-association studies (for example, refs 11–13). The results obtainedhere reinforce that the geographic distribution of a sample is impor-tant to consider when evaluating genome-wide association studies

–0.03 –0.02 –0.01 0 0.01 0.02 0.03–0.03

–0.02

–0.01

0

0.01

0.02

0.03

Italy

Germany

France

UK

SpainPortugal

0 1,000 2,000 3,000

–0.010

0

0.010

0.020

Geographic distance betweenpopulations (km)

Med

ian

gene

tic c

orre

latio

n

PC

1a

b c

French-speaking SwissGerman-speaking SwissItalian-speaking Swiss

FrenchGermanItalian

Nor

th–s

outh

in P

C1–

PC

2 sp

ace

East–west in PC1–PC2 space

PC2

Figure 1 | Population structure within Europe. a, A statistical summary ofgenetic data from 1,387 Europeans based on principal component axis one(PC1) and axis two (PC2). Small coloured labels represent individuals andlarge coloured points represent median PC1 and PC2 values for eachcountry. The inset map provides a key to the labels. The PC axes are rotatedto emphasize the similarity to the geographic map of Europe. AL, Albania;AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH,Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark;ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR,

Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK,Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO,Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE,Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG,Yugoslavia. b , A magnification of the area around Switzerland froma showing differentiation within Switzerland by language. c, Geneticsimilarity versus geographic distance. Median genetic correlation betweenpairs of individuals as a function of geographic distance between theirrespective populations.

NATURE | Vol 456 | 6 November 2008 LETTERS

99 ©2 0 0 8 Macmillan Publis hers Limited. All rights res erved

Novembre, John, et al. "Genes mirror geography within Europe." Nature 456.7218 (2008): 98.

Page 10: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

coordinates of capitals or the city where an individual populationsample had been recruited, respectively. The results overlappedwith previous findings in that the first barrier was seen betweenFinland and all other samples, a second barrier separatedSouthern Italy from the remainder, a third was found betweenWestern Russia, Poland and Lithuania on the one hand, andBulgaria on the other, a fourth was seen between Kuusamo andHelsinki, and a fifth was between the Baltic region and Poland onthe one hand, and Sweden on the other (Figure S4).

Inflation factor lambda (l)Table 2 lists the pair-wise inflation factor l between studied

samples. The inflation factor l was calculated with the method ofthe Genomic Control [19]. We assumed l to be constant acrossthe genome and l was estimated as the median of the observedchi-square statistics divided by the median of the central chi-square distribution with 1 degree of freedom (i.e. 0.456). Thisfactor was found to range from unity (between the samples fromthe same country) to 4.21 (between Spain and the Kuusamoregion). The overall average l value was 1.82; in separate clustersit amounted to 1.23 (Baltic Region, Western Russia and Poland),1.54 (Italy and Spain), 1.22 (Central and Western Europe), and

1.86 (Finland), respectively. The correlation coefficient betweengeographic distance and l was r2 = 0.386 (p-value%0.01). Thisvalue is probably an underestimate of the European-widerelationship due to the inclusion of the Kuusamo and Genevasamples. One is an isolate and the other is a highly heterogeneousinternational metropolis. The l values between CEU and theother samples (Table 2) were smaller than those obtained using theNorthern German sample as a reference, chosen as the nearest tothe origin of CEU sample, and the correlation between geographyand l with CEU was only r2 = 0.251 (p-value 0.017). Both resultsprobably reflect the higher genetic variability in the CEU sample.

The high level of genetic homogeneity in Europe was againhighlighted by the l values calculated between the four HapMapsamples (data not shown), which ranged from 21.56 (YRI vs JPT)via 13.27 (CEU vs CHB) to 1.77 between CHB and JPT. The lvalue between the African and European samples was slightlysmaller than that between the African and Asian samples.

Marker-wise significance testMarker-wise significance test was performed in order to assess

the allelic distribution in pair-wise comparison of studied cohorts(CEU sample was not included) (Table 2). After applying

Figure 2. The European genetic structure (based on 273,464 SNPs). Three levels of structure as revealed by PC analysis are shown: A) inter-continental; B) intra-continental; and C) inside a single country (Estonia), where median values of the PC1&2 are shown. D) European map illustratingthe origin of sample and population size. CEU - Utah residents with ancestry from Northern and Western Europe, CHB – Han Chinese from Beijing, JPT- Japanese from Tokyo, and YRI - Yoruba from Ibadan, Nigeria.doi:10.1371/journal.pone.0005472.g002

Genetic Structure of N-E Euro

PLoS ONE | www.plosone.org 5 May 2009 | Volume 4 | Issue 5 | e5472

Nelis, Mari, et al. "Genetic structure of Europeans: a view from the North–East." PloS one 4.5 (2009): e5472.

Page 11: 06 geotyping and imputation 2019 - Kursused€¦ · ANRV386-GG10-18 ARI 29 July 2009 15:32 SNP mapping arrays (which assay ∼10,000 SNPs) (94). The data were then used to study the

Assign RNA-seq samples to genotyped individuals

(similarly to Hoen et al., 2013): a ‘match’ appears as a point close to

100% concordance for both measures whereas all mismatches appear

as a distant cluster of points (Fig. 1A). As described later, unexpected

intermediate positions allow the user to detect sample cross-

contaminations or PCR amplification bias. In addition, since MBV re-

ports concordance metrics for all genotyped individuals, it can detect

swapped and contaminating samples if included in the input VCF.

To test and characterize the various properties of this new tool,

we use data for 21 1000-Genomes individuals (The 1000 Genomes

Project Consortium, 2015) for which we have: (I) genotype data

produced with whole genome sequencing, (II) gene expression data

from RNA-seq (Waszak et al., 2015) and a tagging technology

CAGE (Cap Analysis of Gene Expression, Takahashi et al., 2012,

unpublished data from A.F.), (III) chromatin-binding profile for the

second largest subunit of RNA polymerase II and the transcription

factors PU.1 (Waszak et al., 2015) as well as (IV) genome wide dis-

tribution of 3 histone modifications (H3K4me1, H3K4me3 and

H3K27ac; Waszak et al., 2015; Supplementary Materials). For all

the 147 corresponding BAM files, mapped with BWA (Li and

Durbin, 2009), we run MBV and investigated the outcome as fol-

lows. First, we computed the number of variant sites as a function of

the minimal sequencing coverage required and conclude that a min-

imal coverage of 10 reads provides enough variants for reasonable

estimates of the two concordance measures across all molecular

assays (Supplementary Fig. S1). We then produced 147 scatter plots

(heterozygous versus homozygous concordance), one for each BAM

file, showing consistent pattern (Fig. 1B): a cluster of points

corresponding to all genotyped individuals not matching to the se-

quence data (in red) and a point close to the 100% concordance (in

green), which corresponds to a match between the genotype and the

sequence data. Combining the scatter plots for the 21 individuals,

we obtained two well-defined clusters, one for the matches and the

other for the mismatches (Fig. 1C). Remarkably, the relative pos-

itions of both clusters remain stable regardless of the source of the

genotypes (sequencing or imputation; Supplementary Methods S3;

Supplementary Fig. S2) or the molecular phenotype assayed

(Supplementary Fig. S3).

Next, we investigated how the position of additional samples

relative to these two clusters can be informative of mislabeling or

technical issues. First, a sample that is unexpectedly part of the mis-

match cluster involves sample mislabeling; the genotype and se-

quence data belong to distinct individuals (Fig. 1C; unexpected data

mismatch). Second, a sample that is unexpectedly part of the match

cluster also implies sample mislabeling; the genotype and sequence

data belong to the same individual but are incorrectly labeled with

distinct IDs (Fig. 1C; unexpected data match). Finally, as shown by

our simulations (Supplementary Methods S4–5), samples located in

neither of the two clusters should be controlled for technical bias

during sample or library preparation. We observed that (I) increas-

ing cross-sample contaminations leads to decreased homozygous

concordance values with no change in heterozygous concordance

(Fig. 1D, in accordance with Castel et al., 2015) while (II) increasing

amplification bias leads to decreased heterozygous concordance

with no change in homozygous concordance (Fig. 1E).

A B C

D E F

Fig. 1. (A) Schematic representation of the MBV method. (B–E) Scatter plots of the concordance at heterozygous genotypes (X-axis) versus concordance at homo-

zygous genotypes (Y-axis). Results for a typical CAGE BAM file (individual GM12814), ‘X’ indicating the 100% concordance point, are shown on (B) Aggregated re-

sults for 21 RNA-seq BAM files are shown on (C) Mislabeling scenarios are indicated with black arrows. (D) The results for simulation of cross-samples

contamination (blue) with a known contaminant (yellow). Percentage of contamination is indicated. (E) The results for simulated amplification bias in sequence

data. (F) The estimated running time to process 1000 individuals across multiple molecular assays (Supplementary Methods S6)

1896 A.Fort et al.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/12/1895/2982050by ELNET Group Account useron 08 November 2017