controlling the false discovery rate in gwas with ...candes/publications/downloads/sesia2… ·...

91
Controlling the false discovery rate in GWAS with population structure Matteo Sesia Department of Data Sciences and Operations, University of Southern California Stephen Bates Department of Statistics, Stanford University Emmanuel Cand` es Departments of Statistics and of Mathematics, Stanford University Jonathan Marchini Genetics Center, Regeneron Pharmaceuticals Chiara Sabatti Departments of Statistics and of Biomedical Data Sciences, Stanford University Abstract This paper proposes a novel statistical method to address population structure in genome- wide association studies while controlling the false discovery rate, which overcomes some limitations of existing approaches. Our solution accounts for linkage disequilibrium and diverse ancestries by combining conditional testing via knockoffs with hidden Markov models from state-of-the-art phasing methods. Furthermore, we account for familial relatedness by describing the joint distribution of haplotypes sharing long identical-by-descent segments with a generalized hidden Markov model. Extensive simulations affirm the validity of this method, while applications to UK Biobank phenotypes yield many more discoveries compared to BOLT-LMM, most of which are confirmed by the Japan Biobank and FinnGen data.

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

Controlling the false discovery rate in GWAS with population structure

Matteo Sesia

Department of Data Sciences and Operations, University of Southern California

Stephen Bates

Department of Statistics, Stanford University

Emmanuel Candes

Departments of Statistics and of Mathematics, Stanford University

Jonathan Marchini

Genetics Center, Regeneron Pharmaceuticals

Chiara Sabatti

Departments of Statistics and of Biomedical Data Sciences, Stanford University

Abstract

This paper proposes a novel statistical method to address population structure in genome-

wide association studies while controlling the false discovery rate, which overcomes some

limitations of existing approaches. Our solution accounts for linkage disequilibrium and

diverse ancestries by combining conditional testing via knockoffs with hidden Markov models

from state-of-the-art phasing methods. Furthermore, we account for familial relatedness by

describing the joint distribution of haplotypes sharing long identical-by-descent segments

with a generalized hidden Markov model. Extensive simulations affirm the validity of this

method, while applications to UK Biobank phenotypes yield many more discoveries compared

to BOLT-LMM, most of which are confirmed by the Japan Biobank and FinnGen data.

Page 2: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

2

INTRODUCTION

Genome-wide association studies (GWAS) measure hundreds of thousands of single-nucleotide

polymorphisms (SNP) in thousands of individuals to identify variants affecting a phenotype of in-

terest. The objective is to reliably determine which genotype-phenotype associations are likely to be

important and which are spurious; several challenges make this difficult, particularly for polygenic

traits which may be influenced by thousands of variants. First, spurious associations may arise

from multiple comparisons; i.e., strong correlations will occur by chance as numerous variables are

tested simultaneously.1 The typical solution is to apply a stringent significance threshold designed

to control the family-wise error rate (FWER)—the probability of a single false positive. However,

this is too conservative for polygenic phenotypes because numerous discoveries are expected.2–5

Indeed, SNPs with effect sizes large enough to pass the FWER threshold do not fully explain the

heritability of complex traits.6 A second source of spurious associations is linkage disequilibrium

(LD): the stochastic dependence of alleles on the same chromosome,7,8 which tends to be stronger

for those physically closer to each other. Thus, SNPs with no effect on the phenotype may be

marginally associated with it simply because they are in LD with a causal variant.5,9 Standard

GWAS methods cannot account for LD,5 so their findings do not precisely localize causal variants

and may be difficult to recognize as distinct when multiple nearby loci are discovered.9 Such am-

biguity may help explain why large studies are reporting associations densely across the genome.10

Lastly, population structure (heterogeneous degrees of similarity between different individuals, due

to diverse ancestries or familial relatedness) may also analogously lead to spurious associations and

has long been of concern in genetic analyses.11–13 Population structure not only induces dependence

between distant loci, it can even create spurious associations when there are no causal variants at

all. For example, if two populations differ in the distribution of the trait solely due to their envi-

ronments, then any SNP whose allele frequency varies across populations will be associated with

the trait. Hence, several methods were developed to account for population structure. An early

remedy built on principal component analysis (PCA),14 although linear mixed models (LMM) have

subsequently become predominant.15–17 While these corrections mitigate the impact of population

structure, they are limited to marginal testing (i.e., they ignore LD) with FWER control.

KnockoffZoom9 was recently proposed to address the limitations of marginal testing and FWER

control. This method accounts for LD through conditional (on nearby loci) rather than marginal

testing, so that its discoveries are clearly distinct and indicative of the presence of causal variants.5,9

The inferences are valid for both quantitative (e.g., body measurement) and qualitative (e.g., dis-

Page 3: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

3

ease status) phenotypes, regardless of their genetic architecture, in contrast to traditional methods,

which rely on linear models whose correctness may be difficult to verify. Practically, KnockoffZoom

partitions the genome into disjoint blocks and tests a conditional association hypothesis for each of

them, which, if rejected, suggests the presence of causal effects.9 While this requires testing the im-

portance of fixed groups of SNPs (determined at will, but before looking at the phenotype), it can

be easily applied at increasing levels of resolution, with finer and finer genome partitions, in order to

localize causal variants as precisely as possible. KnockoffZoom does not analyze imputed variants18

because, without additional assumptions, it is theoretically impossible to test conditional associa-

tions beyond the resolution of the SNP array (imputation cannot introduce additional information

about the phenotype compared to that contained in the typed SNPs).9 Nonetheless, its findings

achieve resolution comparable to that of fine-mapping methods19–21 applied to typed variants.9

Furthermore, KnockoffZoom is more powerful than LMM approaches because it controls the false

discovery rate22 (FDR)—the expected proportion of false discoveries—instead of the FWER. The

main limitation of KnockoffZoom is that, as originally implemented, it is only applicable to rela-

tively homogeneous samples because it does not fully account for population structure. To address

this missing component, we now present an extension that accounts for population structure while

retaining the advantages of conditional testing and FDR control.

Before presenting our results, we review the mechanics of KnockoffZoom. The method estab-

lishes statistical significance through careful data augmentation: it constructs imperfect copies

(knockoffs)5,23 of the genotypes for each individual that are in LD with the real variants and

also have the same allele frequencies, in such a way that replacing a group of genotypes with the

corresponding knockoffs would keep the modified data set statistically indistinguishable from the

original one, except possibly for some reduced association with the phenotype. Knockoffs serve as

negative controls:23,24 they are blindly analyzed jointly with the genotypes (the algorithm ignores

which variables are knockoffs until the very end), through any procedure of choice (e.g., an LMM, a

sparse regression model, or any machine learning tool). The significance of each genetic segment is

determined by contrasting the estimated importance of its genotypes to that of the corresponding

knockoffs. This is a fair comparison because knockoffs behave as the real non-causal variants. In

order to define precisely, and achieve practically, this exchangeability, the distribution of geno-

types is approximated with a hidden Markov model (HMM).5 This assumption is well grounded:

HMMs have already been widely and successfully employed to describe LD,8 for the purposes either

of ancestry inference,25,26 of identifying population-specific haplotype blocks,27 or of phasing and

imputation.28–32

Page 4: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

4

However, KnockoffZoom has so far relied practically on the fastPHASE29 HMM, which is not

designed to account for population structure. In fact, this model cannot simultaneously describe

association between different loci due to mixture33 (individuals with different ancestries), admix-

ture34 (individuals of mixed ancestry), and linkage (allele dependencies between nearby loci)—we

demonstrate this limitation empirically in Supplementary Notes A– C and Supplementary Fig-

ures 1–13. Therefore, KnockoffZoom was previously applied only to homogeneous and unrelated

samples.5,9 In this paper, we build upon the more flexible SHAPEIT35–37 HMM and expand the

applicability of KnockoffZoom to populations with diverse and possibly admixed ancestries. We

also further extend this method to account for familial relatedness, by jointly describing the dis-

tribution of haplotypes from multiple individuals sharing long identical-by-descent (IBD) genetic

segments,38 and then developing a corresponding construction of knockoffs. Crucially, our method

is applicable even if the population structure and the familial relatedness are cryptic, and remains

completely model-free regarding the genetic architecture of the trait.

After testing our method through careful simulations with real genotypes and simulated pheno-

types, we shall apply it to study height, body mass index, platelet count, systolic blood pressure,

cardiovascular disease, respiratory disease, hypothyroidism, and diabetes in the UK Biobank data

set,39 including virtually all samples therein. We will show this yields more numerous (between 25%,

for height, and 320%, for cardiovascular disease) distinct discoveries compared to BOLT-LMM.40

Furthermore, we will verify that most additional discoveries (between 37.3%, for platelet count,

and 88.5%, for diabetes) are validated by the GWAS Catalog,52 the Japan Biobank Project41,

or the FinnGen resource,42 while many of the remaining ones have known associations to related

traits. Finally, we highlight a novel discovery for cardiovascular disease, as one of many promis-

ing findings that may be worthy of further validation. Our full results, which involve thousands

of discoveries, are available from https://msesia.github.io/knockoffzoom-v2/, along with an

efficient software implementation of our method.

RESULTS

Knockoffs preserving population structure

We assume all haplotypes have been phased9 and we approximate their distribution with an

HMM similar to that of SHAPEIT.35–37 This model describes each haplotype sequence as a mosaic

of K reference motifs corresponding to the haplotypes of other individuals in the data set, where

Page 5: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

5

K is fixed (e.g., K = 100); critically, different haplotypes may use different sets of motifs. The

references are chosen based on haplotype similarity; see Methods. The idea is that the ancestry of

an individual should be approximately reflected by the choice of references; e.g., the haplotypes of

someone from England should be well-approximated by a mosaic of haplotypes primarily belonging

to other English individuals. Conditional on the references, the identity of the motif copied at each

position is described by a Markov chain with transition probabilities proportional to the genetic

distances between neighboring sites; different chromosomes are treated as independent. Conditional

on the Markov chain, the motifs are copied imperfectly: relatively rare mutations can independently

occur at any site. This model effectively accounts for population structure as well as LD in the

context of phasing,35–37 and our simulations will demonstrate its usefulness for conditional testing.

Having defined an HMM for each haplotype sequence, knockoffs are generated by repurposing

the algorithm in KnockoffZoom v1,9 which was originally based on the fastPHASE model, but is

sufficiently general to apply here. Our software implementation takes as input phased haplotypes

in standard binary format, builds the data-adaptive SHAPEIT HMM, and returns as a knockoff-

augmented data set that can be conveniently used by KnockoffZoom for conditional testing at the

desired resolution. The knockoff generation procedure is explained in the Methods.

Knockoffs preserving familial relatedness

The above model is not directly applicable to closely related samples because it describes them

independently (conditional on the reference motifs), whereas these share long IBD segments.38,43

Recall that knockoffs must follow the genotype distribution; therefore, processing related samples

independently would yield knockoffs breaking the relatedness structure, invalidating our inferences

(see Methods for a full explanation). We fix this by jointly modeling haplotypes in the same family.

We begin by detecting long IBD segments in the data.44–47 If the pedigree is known in advance,

one can restrict the IBD search within the given families; otherwise, there exists efficient software

to approximately reconstruct families from data at the UK Biobank scale.48 The results thus

obtained define a relatedness graph, where two haplotype sequences are connected if they share an

IBD segment; we refer to the connected components of this graph as the IBD-sharing families.

Conditional on the location of the IBD segments, we define a larger HMM jointly describing the

distribution of all haplotypes in each IBD-sharing family. Marginally, each haplotype is modeled by

the SHAPEIT HMM; however, different haplotypes are coupled along the IBD segments (and we

avoid using haplotypes in the same family as references for one another). This model is explicitly

Page 6: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

6

described in the Methods, where we explain how to generate knockoffs for it. The details are

technical, since the algorithm in Sesia et al.9 is no longer feasible due to the much larger state

space of this new HMM. We shall demonstrate empirically that our method generates knockoffs

sharing the same IBD segments as the real haplotypes, and also simultaneously preserves LD.

Numerical experiments with genetic data

Setup

We test the proposed methods via simulations based on subsets of phased haplotypes from the

UK Biobank data, chosen as to have strong population structure either due to diverse ancestries

or familial relatedness. After some pre-processing (see Methods), we partition each of the 22

autosomes into contiguous groups of SNPs at 7 different levels of resolution, ranging from that

of single SNPs to that of 425 kb-wide groups; see Supplementary Table 1. These partitions are

obtained by applying complete-linkage hierarchical clustering to the SNPs (genetic distances are

used as similarity measures) and cutting the resulting dendrogram at different heights. For each

such partition, we generate knockoffs with two alternative methods. In one case, we fit fastPHASE

using K = 50 latent HMM motifs, even though this is not designed for data with population

structure, and then apply KnockoffZoom v1.9 In the other case, we apply our new method to

generate knockoffs, and then we proceed from there to select important groups of SNPs as in

KnockoffZoom v1; we refer to this approach as KnockoffZoom v2.

Knockoffs preserving population structure

We generate knockoffs for 10,000 unrelated individuals with one of 6 different self-reported

ancestries (Supplementary Table 2). We will perform a diagnostic check of these knockoffs by

verifying their exchangeability with the genotypes through a PCA and a covariance analysis; then,

we will assess the performance of KnockoffZoom for simulated phenotypes.

A property of valid knockoffs is that their distribution is the same as that of the genotypes.5,23

(This follows from the stronger exchangeability requirement; see Methods). Thus, the proportion of

knockoff variance explained by their top principal components should be close to the corresponding

quantity computed on the genotypes. The PCA in Supplementary Figures 14–15 demonstrates that

our knockoffs preserve population structure quite accurately in this sense, unlike those based on the

fastPHASE HMM. This holds even at low resolution, where the sensitivity to model misspecification

Page 7: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

7

is highest.9 The covariance analysis9 in Supplementary Figures 16–22 confirms that our knockoffs

are approximately exchangeable with the genotypes and should have power comparable to that of

knockoffs based on the fastPHASE HMM.

Next, we simulate continuous phenotypes conditional on the true genotypes, from a homoscedas-

tic linear model with 500 causal variants distributed uniformly across the genome; the total heri-

tability is varied as a control parameter.9 We apply KnockoffZoom v1 and v2 on these data, using

lasso-based statistics.49 Supplementary Figure 23 shows the histogram of test statistics, which

should be symmetric around zero for null groups (i.e., those without causal variants). The statis-

tics obtained with the new knockoffs satisfy this property, while the fastPHASE model leads to a

rightward bias, which may result in an excess of false positives. By contrast, both methods yield

similarly distributed statistics for causal groups. The power9 and FDR are compared in Figure 1:

KnockoffZoom v2 has slightly lower power, but always controls the FDR.

single-SNP 208 kb 425 kb

FD

RP

ower

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

Heritability

Method

KnockoffZoom v2

KnockoffZoom v1

FIG. 1. Power and FDR in simulations involving samples with diverse ancestries. KnockoffZoom

performance at different resolutions on artificial phenotypes and real genotypes of 10,000 samples with very

diverse ancestries, using alternative knockoff constructions. The nominal FDR is 10%. The results are

averaged over 10 experiments with independent phenotypes; the vertical bars indicate standard errors.

Supplementary Figure 24 presents analogous results in simulations with sparser signals, while

Supplementary Figure 25 summarizes findings at different resolutions by counting only the most

specific ones.9 Supplementary Figure 26 reports on simulations where SNP importance is estimated

by BOLT-LMM instead of the lasso;50 the LMM is less powerful, but it makes fastPHASE knockoffs

even more susceptible to population structure, while our new method remains valid.

Page 8: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

8

Knockoffs preserving familial relatedness

We test our method on 10,000 British individuals in 4,900 self-reported families; see Supplemen-

tary Table 3 and Supplementary Figure 27 for details. We use RaPID48 to detect IBD segments

wider than 3 cM, chromosome-by-chromosome, adopting the recommended parameters. After dis-

carding, for simplicity, segments shared by individuals who do not belong to the same self-reported

family, we are left with 723,454 of them. Their mean width is 19.6 Mb, or 26.1 cM, and each

contains 4238 SNPs on average (Supplementary Figure 28). We then generate knockoffs preserving

these IBD segments, and compare the results with those obtained disregarding relatedness.

Supplementary Figure 29 shows that knockoffs would not preserve IBD segments if we did not

explicitly enforce such constraint, especially at low resolution. The diagnostics in Supplementary

Figure 30 confirm that our method correctly preserves LD, and Supplementary Figure 31 demon-

strates that accounting for relatedness does not decrease power; to the contrary, it can increase it

by ensuring that closely related haplotypes are not used as references for one another, which would

reduce the desired contrast between genotypes and knockoffs.

We simulate binary phenotypes from a liability threshold (probit) model with 100 uniformly

distributed causal variants; the numbers of cases and controls are balanced. (We consider binary

phenotypes, as opposed to continuous phenotypes as in the previous section, simply to highlight

the flexibility of our method, which is equally valid regardless of the distribution of the trait). We

include in this model an additive random term for each family, mimicking shared environmental

effects, whose strength is smoothly controlled by a parameter γ ∈ [0, 1] (Methods). The phenotypes

of different individuals in the same family are conditionally independent given the genotypes if

γ = 0, while identical twins will always have the same phenotype if γ = 1. In theory, environmental

effects may introduce spurious associations, unless the knockoffs account for familial relatedness;

this point is demonstrated empirically below and explained rigorously in the Methods.

Figure 2 reports FDR and power at low-resolution, with and without preserving relatedness.

This shows that preserving IBD segments enables FDR control even with extreme environmental

factors (γ = 1), with virtually no power loss. However, KnockoffZoom v2 is reasonably robust even

if relatedness is ignored, especially at higher resolution (Supplementary Figure 32). This partly

depends on the multivariate importance statistics used here (i.e., sparse logistic regression); in fact,

marginal statistics are more vulnerable to confounding, as illustrated in Supplementary Figure 33.

Finally, Supplementary Figures 34–35 confirm that the test statistics for null groups of SNPs are

symmetrically distributed if relatedness is preserved.

Page 9: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

9

γ: 0 γ: 0.655 γ: 1FDR

Pow

er

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

Heritability

Relatedness

Preserved

Ignored

FIG. 2. Power and FDR in simulations with familial relatedness. KnockoffZoom v2 performance on

artificial phenotypes and real genotypes of 10,000 related samples. Our method is applied with and without

preserving IBD segments. Results for phenotypes with different strengths of environmental effects γ are in

separate columns (γ = 0: no environmental effects, γ = 1: strongest environmental effects; see Methods for

more information about γ). Knockoff resolution equal to 425 kb. Other details are as in Figure 1.

.

Analysis of the UK Biobank phenotypes

Setup

We analyze 486,975 individuals both genotyped and phased in the UK Biobank, 136,818 of

which have close relatives; see Methods for details on quality control. Most samples are British

(429,934), Irish (12,702), or other Europeans (16,292). By running RaPID48 within the 57,164

families, as in the simulations, we detect 7,087,643 long IBD segments across the 22 autosomes.

We then apply KnockoffZoom v2 at different resolutions (as in the simulations), aiming to control

the FDR below 10%, using knockoffs preserving both population structure and familial relatedness.

Page 10: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

10

Discoveries with different subsets of individuals

We study 4 continuous traits (height, body mass index, platelet count, systolic blood pressure)

and 4 diseases (cardiovascular disease, respiratory disease, hyperthyroidism, diabetes), using dif-

ferent subsets of samples to compare performance. The phenotypes are defined in Supplementary

Table 4. In order to increase power, we include in the KnockoffZoom predictive model, along with

the genotypes and the knockoffs, a few relevant covariates that help explain some variation in the

phenotype,9 as explained in the Methods. These covariates include the top principal components

of the genetic matrix,14 although it is worth emphasizing that the validity of our method does not

depend on them, since we account for population structure through the knockoffs. The numbers of

low-resolution (208 kb) discoveries are in Figure 3; this demonstrates that including relatives yields

many more discoveries, while including different ancestries tends to have a smaller impact (unsur-

prisingly, since we have relatively few non-British samples). Table I summarizes the gains in the

numbers of discoveries at different resolutions allowed by KnockoffZoom v2, either by leveraging

related samples, or by including non-British individuals. This shows an increase in power, except

at the single-SNP resolution; this exception may be partially explained by the fact that single-

SNP discoveries are fewer and thus more affected by the random variability in the knockoffs.9

Supplementary Tables 5–6 report the numbers of discoveries at other resolutions, as well as those

obtained from the analysis of European non-British samples only. The full results are available at

https://msesia.github.io/knockoffzoom-v2/, along with an interactive visualization tool.

Table II demonstrates that KnockoffZoom v2 is much more powerful than BOLT-LMM, when

the latter is applied to the 459k European samples;40 in fact, we discover almost all findings reported

by BOLT-LMM and many new ones. (See Supplementary Tables 7–8 for more detailed results at

other levels of resolution.) This is consistent with KnockoffZoom v1,9 although our method is even

more powerful (except for the single-SNP resolution), as shown in Supplementary Tables 9–10.

Validation of novel discoveries

We begin to validate our findings by comparing them with those in the GWAS Catalog,52

in the Japan Biobank Project,41 and in the FinnGen resource42 (we use the standard 5 × 10−8

threshold for the p-values reported by the latter two). For simplicity, hereafter we focus on our

findings obtained including both related and non-British individuals. Supplementary Tables 11–

12 show that most of our high-resolution discoveries correspond to SNPs previously known to be

Page 11: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

11

cvd respiratory hypothyroidism diabetes

height bmi platelet sbp

Everyone British Everyone British Everyone British Everyone British

Everyone British Everyone British Everyone British Everyone British0

300600900

1200

0255075

100

0500

10001500

0

100

200

300

0500

1000150020002500

0

100

200

0100020003000

0250500750

Population

Discoveries

Related samples Included Excluded

FIG. 3. Numbers of KnockoffZoom discoveries for UK Biobank phenotypes. Low-resolution

(208 kb) discoveries using data from subsets of individuals with different self-reported ancestries.

Including related samples Including non-British samples

Everyone British Related Unrelated

ResolutionTotal

Change

(%)Total

Change

(%)Total

Change

(%)Total

Change

(%)

single-SNP 138–167 21.0 155–125 -19.4 125–167 33.6 155–138 -11.0

3 kb 921–992 7.7 655–971 48.2 971–992 2.2 655–921 40.6

20 kb 2814–3527 25.3 2808–3355 19.5 3355–3527 5.1 2808–2814 0.2

41 kb 4419–5867 32.8 4353–5354 23.0 5354–5867 9.6 4353–4419 1.5

81 kb 6784–8031 18.4 6676–7781 16.6 7781–8031 3.2 6676–6784 1.6

208 kb 8776–10270 17.0 8635–10049 16.4 10049–10270 2.2 8635–8776 1.6

425 kb 9401–10730 14.1 9028–10297 14.1 10297–10730 4.2 9028–9401 4.1

Sample size 408k–487k 19.4 356k–430k 20.8 430k–487k 13.0 356k–408k 14.6

TABLE I. Effect of sample size increases on numbers of KnockoffZoom v2 discoveries. Cumulative

numbers of discoveries for all UK Biobank phenotypes at different resolutions, with different subsets of the

samples. For example, including related individuals increases by 16.4% the number of discoveries obtained

from the British samples at the 208 kb resolution (from 8635 to 10,049). As another example, adding non-

British individuals (including related ones) increases by 2.2% the number of discoveries obtained from the

British samples (including related ones) at the 208 kb resolution (from 10049 to 10,270).

Page 12: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

12

KnockoffZoom v2 BOLT-LMM

Phenotype Discoveries Overlap with LMM Discoveries Overlap with KZ

bmi 2395 898 (37.5%) 697 689 (98.9%)

cvd 940 274 (29.1%) 257 249 (96.9%)

diabetes 113 52 (46.0%) 62 55 (88.7%)

height 3339 2228 (66.7%) 2464 2430 (98.6%)

hypothyroidism 295 129 (43.7%) 143 142 (99.3%)

platelet 1743 1057 (60.6%) 1204 1183 (98.3%)

respiratory 262 82 (31.3%) 94 92 (97.9%)

sbp 1183 561 (47.4%) 568 530 (93.3%)

TABLE II. Comparison of KnockoffZoom v2 at low resolution with BOLT-LMM. KnockoffZoom

discoveries (208 kb resolution, 10% FDR) using all 487k UK Biobank samples for different phenotypes,

and corresponding BOLT-LMM genome-wide significant discoveries (5× 108); the latter is applied on 459k

European samples40 for all phenotypes except diabetes and respiratory disease, for which it is applied on

350k unrelated British samples,9 for the sake of consistency in phenotype definitions. For example, we report

940 distinct discoveries for cardiovascular disease, 274 of which contain significant associations according to

BOLT-LMM. The latter reports a total of 257 discoveries (clumped9 with the standard PLINK51 algorithm)

for this phenotype, 96.9% of which overlap with one of our discoveries.

associated with the phenotype of interest. This is particularly true for those findings that are

also detected by BOLT-LMM, although many of our additional discoveries are also confirmed; see

Supplementary Tables 13–14. For example, our method reports 1089 findings for cardiovascular

disease at the 425 kb resolution, only 255 of which can be detected by BOLT-LMM; however, 85.6%

of our additional 834 discoveries are confirmed in at least one of the aforementioned resources.

Furthermore, Supplementary Table 15 suggests that most relevant associations in the Catalog

(above 70%) are confirmed by our findings, which is again indicative of high power. The relative

power of our method (i.e., the proportion of previously reported associations that we discover) seems

to be above 90% for quantitative traits, but lower than 50% for all diseases except hypothyroidism,

probably due to the relatively small number of cases in the UK Biobank data set compared to

more targeted case-control studies.

The 5×10−8 genome-wide threshold for the Japan Biobank Project and the FinnGen resource is

overly conservative given that our goal is to confirm selected discoveries. Therefore, we next utilize

these independent summary statistics for an enrichment analysis. The idea is to compare the

distribution of the external statistics corresponding to our selected loci to that of loci from the rest

Page 13: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

13

Total Not found by BOLT-LMM

Confirmed Confirmed

Phenotype Discoveries Other Other or Enrich. Discoveries Other Other or Enrich.

bmi 2395 1076 (44.9%) 1620 (67.6%) 1497 335 (22.4%) 806 (53.8%)

cvd 940 738 (78.5%) 764 (81.3%) 666 472 (70.9%) 493 (74.0%)

diabetes 113 97 (85.8%) 106 (93.8%) 61 46 (75.4%) 54 (88.5%)

height 3339 1886 (56.5%) 2493 (74.7%) 1111 164 (14.8%) 556 (50.0%)

hypothyroidism 295 156 (52.9%) 226 (76.6%) 166 43 (25.9%) 101 (60.8%)

platelet 1743 453 (26.0%) 1017 (58.3%) 686 29 (4.2%) 256 (37.3%)

respiratory 262 241 (92.0%) NA 180 159 (88.3%) NA

sbp 1183 643 (54.4%) 885 (74.8%) 622 154 (24.8%) 358 (57.6%)

TABLE III. Validation of findings through comparisons with other studies and enrichment.

Numbers of low-resolution (208 kb) discoveries obtained with our method and confirmed by other studies,

or by an enrichment analysis carried out on external summary statistics. For example, 81.3% of our 940

discoveries for cardiovascular disease are confirmed either by the results of other studies, or by the enrichment

analysis. The results are stratified based on whether our findings can be detected by BOLT-LMM using the

UK Biobank data (excluding non-European individuals).

of the genome, as explained in Supplementary Note D. This approach can estimate the number of

replicated discoveries but it has the limitation that it cannot tell exactly which ones are confirmed;

therefore, we will consider alternative validation methods later. (A more precise analysis is possible

here in theory, but has low power; see Supplementary Note D). Supplementary Tables 16–17

show that many additional discoveries can thus be validated, especially at high resolution. (See

Supplementary Tables 18–19 for more details about enrichment.) Table III summarizes these

confirmatory results. Respiratory disease is excluded from the enrichment analysis because the

FinnGen resource divides it among several fields, so it is unclear how to best obtain a single p-

value. In any case, the GWAS Catalog and the FinnGen resource already directly validate 90% of

our new findings for this phenotype.

We continue the validation by inspecting the novel discoveries (i.e., those missed by BOLT-

LMM and unconfirmed by the above studies) and cross-referencing them with the genetics litera-

ture, focusing for simplicity on the 20 kb resolution. Supplementary Table 20 shows that almost

all discoveries contain genes, and most have known associations to phenotypes closely related to

that of interest (Supplementary Table 21). Furthermore, most lead SNPs (those with the largest

Page 14: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

14

importance measure in each group9) have functional annotations (Supplementary Table 22).

Figure 4 showcases one of our novel discoveries for cardiovascular disease. The finest finding

here spans 4 genes, but we could not find previously reported associations with cardiovascular

disease within this locus. However, one of these genes (SH3TC2) is known to be associated with

blood pressure,53 while another (ABLIM3) is known to be associated with body mass index.54

7.3

-lo

g 10(

p)

single-SNP3

204181

208425

Res

olu

tion

(kb

)

Manhattan plot (BOLT-LMM)

Chicago plot (KnockoffZoom v2)

Genes

NA

NA

NANA

NA

NA

NA

NANA

NA

NA

NA

NANA

NA

ABLIM3 →SH3TC2 ←

LOC255187 →

MIR584 ←

148.2 148.4 148.6 148.8

Chromosome 5 (Mb)

148.2 148.4 148.6 148.8

FIG. 4. Novel discovery for cardiovascular disease on the UK Biobank data. Center: the shaded

rectangles indicate the genetic segments detected by our method at different resolutions. Bottom: genes in

the locus spanned by our finest discovery. Top: BOLT-LMM marginal p-values computed on UK Biobank

samples with European ancestry,40 for genotyped and imputed variants within this locus. All BOLT-LMM p-

values within this locus are larger than 5×10−8; those larger than 10−3 are hidden. Note that KnockoffZoom

does not analyze imputed variants, since these do not carry any additional information about the phenotype.9

DISCUSSION

This paper has developed a new algorithm for constructing genetic knockoffs, inspired by the

SHAPEIT phasing method, that extends the applicability of KnockoffZoom to the analysis of data

with population structure. In particular, we can include related and ethnically diverse individuals

while controlling the FDR. This extension is crucial for several reasons. Firstly, very large studies

Page 15: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

15

are sampling entire populations quite densely,42,55 which yields many closely related samples. It

would be wasteful to discard this information, and potentially dangerous not to carefully account for

relatedness. Secondly, the historical lack of diversity in GWAS (which mostly involve individuals

of European ancestry) is a well-recognized problem,56,57 which biases our scientific knowledge

and disadvantages the health of under-represented populations. While this issue goes beyond the

statistical difficulty of analyzing diverse GWAS data, our work should at least help remove a

technical barrier. Firstly, by allowing the simultaneous analysis of diverse populations, we can

borrow strength from one another and increase power, as the different patterns of LD in different

populations may uncover causal variants more effectively.58 Secondly, discoveries obtained from the

analysis of diverse data may improve our ability to explain phenotypic variation in the minority

populations.59,60 Since the UK Biobank mostly comprises of British individuals, the increase in

power resulting from the analysis of diverse samples can only be relatively small. Nonetheless,

we observe some gains when we include non-British individuals. This is promising, especially

since simulations demonstrate that our inferences are valid even when the population is extremely

heterogeneous. Therefore, in the near future, we would like to apply our method to more diverse

data sets, such as that collected by the Million Veteran Program,61 for example.

Our method accounts for LD, population structure, and cryptic relatedness, which are the major

sources of confounding in GWAS data, so our discoveries are directly interpretable and may be

portrayed in a causal sense relatively safely;5,9 indeed, a closely related approach yields formal

causal claims in the special case of parent-child trio data.62 Furthermore, our inferences require

no assumptions about the genetic architecture of the phenotype, which makes KnockoffZoom very

versatile. It is worth stressing that our simulations are based on genetic data, do not provide any

additional information to our method about the relatedness or population structure other than

that already available to the analyst, and do not exploit any knowledge of the architecture of the

simulated traits. Therefore, the results are informative about the general validity of our method.5,9

Confirmatory analyses demonstrate that a large proportion of our discoveries are either con-

sistent with the findings of BOLT-LMM on the same data, or are supported by other studies.

This is of interest, especially since we do not have access to studies larger than the UK Biobank,

so we cannot expect to already replicate all novel discoveries. Furthermore, many of our un-

confirmed findings are associated to related phenotypes or contain protein-coding genes that are

over-expressed in the relevant tissues. Even though our method does not perform fine-mapping in

the traditional sense—we do not work with imputed variants—it is more flexible and appreciably

more powerful than the existing genome-wide alternatives, such as BOLT-LMM. Furthermore, our

Page 16: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

16

discoveries are more tightly localized than those based on marginal testing, and hence immediately

more interpretable, which will facilitate any follow-up analysis.

The possibility to include individuals of diverse ancestries opens alluring research opportunities.

For example, we would like to understand which discoveries are consistent across populations

and which are more specific, as this may help further weed out false positives, explain observed

variations in phenotypes, and possibly shed more light onto the underlying biology.

SOFTWARE AVAILABILITY

We have implemented our methods in a standalone software written in C++, which is available

from https://msesia.github.io/knockoffzoom-v2/. This takes as input phased haplotypes in

BGEN format63 and outputs genotype knockoffs at the desired resolution in the PLINK51 BED

format. Our software is designed for the analysis of large data: it is multi-threaded and memory

efficient. Furthermore, if sufficient computational resources are available, knockoffs for different

chromosomes can be constructed in parallel. For reference, it took us approximately 4 days using

10 cores and 80GB of memory to generate knockoffs for the UK Biobank data on chromosome 1

(approximately 1M haplotype sequences, 600k SNPs, and 600k IBD segments).

DATA AVAILABILITY

Data from the UK Biobank Resource (application 27837); see https://www.ukbiobank.ac.uk/.

ACKNOWLEDGEMENTS

M. S. was in the Department of Statistics at Stanford University. M. S., S. B., E. C. and C. S.

were supported by NSF grant DMS 1712800. S. B. was also supported by a Ric Weiland fellowship.

E. C. and C. S. were also supported by NSF grant OAC 1934578 and by a Math+X grant (Simons

Foundation). We thank Kevin Sharp (University of Oxford) for sharing useful computer code. The

authors acknowledge the participants and investigators of the UK Biobank project, the FinnGen

study, and the Japan Biobank Project.

1 Lander, E. & Kruglyak, L. Genetic dissection of complex traits: guidelines for interpreting and reporting

linkage results. Nat. Genet. 11, 241–247 (1995).

Page 17: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

17

2 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci.

U.S.A 100, 9440–9445 (2003).

3 Sabatti, C., Service, S. & Freimer, N. False discovery rate in linkage and association genome screens for

complex disorders. Genetics 164, 829–833 (2003).

4 Brzyski, D. et al. Controlling the rate of GWAS false discoveries. Genetics 205, 61–75 (2017).

5 Sesia, M., Sabatti, C. & Candes, E. Gene hunting with hidden Markov model knockoffs. Biometrika

106, 1–18 (2019).

6 Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

7 Hill, W. & Robertson, A. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231

(1968).

8 Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using

single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

9 Sesia, M., Katsevich, E., Bates, S., Candes, E. & Sabatti, C. Multi-resolution localization of causal

variants across the genome. Nat. Comm. 11, 1093 (2020).

10 Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484

(2019).

11 Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

12 Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured

populations. Am. J. Hum. Genet. 67, 170–181 (2000).

13 Sul, J. H., Martin, L. S. & Eskin, E. Population structure in genetic studies: Confounding factors and

mixed models. PLoS Genet. 14, e1007309 (2018).

14 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association

studies. Nat. Genet. 38, 904–909 (2006).

15 Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of

relatedness. Nat. Genet. 38, 203–208 (2006).

16 Kang, H. M. et al. Efficient control of population structure in model organism association mapping.

Genetics 178, 1709–1723 (2008).

17 Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association

studies. Nat. Genet. 42, 348–354 (2010).

18 Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet.

11, 499–511 (2010).

19 Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants

by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).

20 Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in

regression, with application to genetic fine mapping. J. R. Stat. Soc. B. (2020).

21 Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci

with multiple signals of association. Genetics 198, 497–508 (2014).

Page 18: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

18

22 Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach

to multiple testing. J. R. Stat. Soc. B. 57, 289–300 (1995).

23 Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: Model-X knockoffs for high-dimensional

controlled variable selection. J. R. Stat. Soc. B. 80, 551–577 (2018).

24 Barber, R. F. & Candes, E. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085

(2015).

25 Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype

data. Genetics 155, 945–959 (2000).

26 Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype

data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).

27 Koivisto, M. et al. An MDL method for finding haplotype blocks and for estimating the strength of

haplotype block boundaries. In Biocomputing 2003, 502–513 (World Scientific, 2002).

28 Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat.

Rev. Genet. 12, 703 (2011).

29 Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data:

applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644

(2006).

30 Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide

association studies by imputation of genotypes. Nat. Genet. 39, 906 (2007).

31 Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the

next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

32 Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data

to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).

33 Weir, B. S. Genetic Data Analysis (Sinauer, Sunderland, Massachusetts, 1990).

34 Stephens, J. C., Briscoe, D. & O’Brien, S. J. Mapping by admixture linkage disequilibrium in human

populations: limits and guidelines. Am. J. Hum. Genet. 55, 809 (1994).

35 Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes.

Nat. Methods 9, 179 (2012).

36 Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and popu-

lation genetic studies. Nat. Methods 10, 5 (2013).

37 O’Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817 (2016).

38 Thompson, E. A. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics

194, 301–326 (2013).

39 Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562,

203–209 (2018).

40 Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-

scale datasets. Nat. Genet. 50, 906–908 (2018).

Page 19: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

19

41 Japan, B. Biobank Japan Project (2020). URL http://jenger.riken.jp/en/.

42 FinnGen. FinnGen documentation of r3 release (2020). URL https://finngen.gitbook.io/

documentation/.

43 Sved, J. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor.

Popul. Biol. 2, 125–141 (1971).

44 Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat.

Genet. 40, 1068 (2008).

45 Gusev, A. et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 19,

318–326 (2009).

46 Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. Am. J.

Hum. Genet. 88, 173–182 (2011).

47 Bjelland, D. W., Lingala, U., Patel, P. S., Jones, M. & Keller, M. C. A fast and accurate method for

detection of IBD shared haplotypes in genome-wide SNP data. Eur. J. Hum. Genet. 25, 617–624 (2017).

48 Naseri, A., Liu, X., Tang, K., Zhang, S. & Zhi, D. Rapid: ultra-fast, powerful, and accurate detection of

segments identical by descent (ibd) in biobank-scale cohorts. Genome Biol. 20, 143–143 (2019).

49 Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:

Series B (Methodological) 58, 267–288 (1996).

50 Marchini, J. L. Discussion of gene hunting with hidden Markov model knockoffs. Biometrika 106, 27–28

(2019).

51 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Am. J. Hum. Genet. 81, 559–575 (2007).

52 Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies,

targeted arrays and summary statistics 2019. Nucleic acids research 47, D1005–D1012 (2019).

53 Hoffmann, T. J. et al. Genome-wide association analyses using electronic health records identify new

loci influencing blood pressure variation. Nature Genet. 49, 54 (2017).

54 Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature

518, 197–206 (2015).

55 deCODE genetics. https://www.decode.com/ (2019). Accessed: 2019-12-06.

56 Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature News 538, 161 (2016).

57 Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177,

26–31 (2019).

58 Rosenberg, N. A. et al. Genome-wide association studies in diverse populations. Nat. Rev. Genet. 11,

356–366 (2010).

59 Bitarello, B. D. & Mathieson, I. Polygenic scores for height in admixed populations. bioRxiv preprint

(2020).

60 Cavazos, T. B. & Witte, J. S. Inclusion of variants discovered from diverse populations improves polygenic

risk score transferability. bioRxiv preprint (2020).

Page 20: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

20

61 Gaziano, J. M. et al. Million Veteran Program: A mega-biobank to study genetic influences on health

and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

62 Bates, S., Sesia, M., Sabatti, C. & Candes, E. J. Causal inference in genetic trio studies. arXiv preprint

2002.09644 (2020).

63 Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. BioRxiv

308296 (2018).

64 Sesia, M., Sabatti, C. & Candes, E. Rejoinder: Gene hunting with hidden Markov model knockoffs.

Biometrika 106, 35–45 (2019).

65 Sabatti, C. Multivariate Linear Models for GWAS, 188–207 (Cambridge University Press, 2013).

66 Prive, F., Aschard, H., Ziyatdinov, A. & Blum, M. G. B. Efficient analysis of large-scale genome-wide

data with two R, packages: bigstatsr and bigsnpr. Bioinformatics 34, 2781–2787 (2018).

67 Consortium, I. H. . et al. Integrating common and rare genetic variation in diverse human populations.

Nature 467, 52 (2010).

68 Kinderman, R. & Snell, S. Markov random fields and their applications (American Mathematical Society,

Providence, RI, USA, 1980).

69 Yedidia, J., Freeman, W. & Weiss, Y. Understanding belief propagation and its generalizations. In

Exploring Artificial Intelligence in the New Millenium, vol. 8, 239–269 (Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 2003).

70 Bates, S., Candes, E., Janson, L. & Wang, W. Metropolized knockoff sampling. J. Am. Stat. Assoc.

1–25 (2020).

METHODS

Formal definition of the inferential objective

We begin by formally stating our objective. Let Y ∈ Rn be a phenotype of interest measured

for n subjects, and let X ∈ Rn×p be the corresponding genotypes at p sites, which are assumed

to be random variables from an HMM, as we will discuss shortly. Our goal is to detect genomic

regions containing distinct associations with the phenotype, as precisely as possible. Formally, we

seek this objective by testing conditional null hypotheses23 in the form of

H0,g : Y |= XGg | X−Gg , (1)

where G = (G1, . . . , GL) is a fixed partition of {1, . . . , p}, XGg denotes the variants in group Gg,

and X−Gg denotes those outside it. This framing is different from that employed by classical

GWAS techniques, where one is only able to test marginal independence between the phenotype

and a single variant Xj—a scientifically less interesting hypothesis.5,64 In particular, conditioning

Page 21: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

21

on the remainder of the genome (X−Gg in the above notation) ensures that a rejection of the null

hypothesis implies the presence of a unique association of Y with the variants in the region Gg,

rather than a spurious correlation caused by LD. Moreover, such tests are naturally robust to the

confounding of population structure,64 and even enable formal causal inferences in some cases.62

While we do not have the space for a full account here, previous work has already discussed at

length the relative advantages of conditional testing over marginal testing.5,9,62,64 The only caveat is

that correctly implementing these tests using GWAS data is technically challenging in the presence

of population structure9—a difficulty that we address here by building upon KnockoffZoom.9

Review of KnockoffZoom

KnockoffZoom is a recently-introduced statistical technique for genome-wide association studies

that simultaneously tests the conditional null hypotheses defined above for all groups of variants in

any given partition of the genome, controlling the FDR (the expected number of false discoveries)

below a desired target level (e.g., 10%). By applying this procedure multiple times with increas-

ingly refined partitions, one can localize interesting conditional associations at different levels of

resolutions, ranging from a few hundred kilo-bases to single-SNP precision. Unlike those of other

approaches, the statistical guarantees of KnockoffZoom do not rely on any assumption about the

genetic architecture of the trait, such as linearity and Gaussian errors. Instead, KnockoffZoom only

needs to model the distribution of the genotypes with an HMM, consistently with the standard

approaches for phasing and imputation.5,9 To test the hypotheses in (1), KnockoffZoom leverages

a statistical technology called knockoffs,5,23,24 which we describe informally below.

For the genotypes X(i) of any individual i, we construct in silico a synthetic “knockoff” version

X(i) ∈ Rp, in such a way that X and X are statistically exchangeable at the population level,

except for the fact that X has no conditional associations with Y . In other words, for any group

Gg in the given genome partition, the distribution of XGg and XGg are indistinguishable unless

there is a conditional association between XGg and Y . This property, which will be formally stated

later, allows the knockoff genotypes X to serve as negative controls for X.23 The intuition is that,

by construction of the knockoffs, any detectable difference between XGg and XGg implies that the

region Gg must contain a distinct association with Y . Practically, one can control the FDR for the

conditional hypotheses in (1) by computing a test statistics for each group of variants (i.e., as the

contrast between any empirical association measures between Y and XGg , XGg , respectively) and

then rejecting the null hypotheses whose statistics are greater than the data-adaptive significance

Page 22: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

22

threshold computed by the knockoff filter.24 To maximize the number of discoveries, one should use

powerful association (or, importance) measures; the typical solution is to compute these starting

from a multivariate predictive model, as explained next.23

KnockoffZoom9 fits a sparse generalized linear regression model of Y given the augmented data

[X, X] ∈ Rn×2p, after standardizing X and X to have unit variance. Then, letting βj(λCV) and

βj+p(λCV) indicate the estimated coefficients for Xj and Xj , respectively, it defines the importance

measures Tg =∑

j∈Gg|βj(λCV)| and Tg =

∑j∈Gg

|βj+p(λCV)| for each group of variants Gg. (The

regularization parameter λCV is tuned by cross-validation). We adopt these statistics because they

have the advantage of being powerful,9 interpretable for GWAS,65 and computationally feasible

even with very large data.66 However, the knockoffs framework can theoretically accommodate

virtually any statistics.23 The importance measures are combined into a test statistic for each

group of variants, i.e., Wg = Tg − Tg, which is finally provided as input to the knockoff filter24 to

select significant associations. In particular, the knockoff filter selects groups of SNPs whose W

is sufficiently large (i.e., here large values of W imply that the corresponding real genotypes have

larger regression coefficients compared to the knockoffs). See Figure 5 for a schematic summary of

the entire procedure.

Data

Data augmentation

Predictive system Calibration

Novel component

Y X H knockoff algorithm

haplotype model

H X

(X, X)

predictive model

genome partition

test statistics knockoff filter discoveries

FDR level

phasing de-phasing

FIG. 5. KnockoffZoom workflow. The novelty consists of an HMM for the distribution of haplotypes,

H, that can account for population structure and familial relatedness as well as LD, and of the associated

algorithm for generating knockoffs. For computational reasons, the genotypes are phased prior to generating

knockoffs, and the knockoff haplotypes are then de-phased to obtain knockoff genotypes.9

Page 23: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

23

KnockoffZoom model setup

KnockoffZoom v15 assumes all samples to be independent and identically distributed (i.i.d.):

(X(i), Y (i))i.i.d.∼ PX,Y = PX · PY |X ,

where the distribution of the genotypes, PX , is approximated by the fastPHASE HMM,29 and PY |X

is left completely unspecified.23 The fastPHASE HMM restricts the applicability of KnockoffZoom

to homogeneous samples9 (see also Supplementary Notes A– C), while the i.i.d. assumption is

naturally ill-suited to described closely related samples that, in addition to sharing long nearly

identical portions of DNA, may also be exposed to the same environmental factors affecting their

phenotypes. Our present work shows how to overcome these limitations. We generalize the above

setup by assuming a fixed family structure and by grouping together individuals in the same

family, which are no longer treated independently. Furthermore, the distribution of the genotypes

is allowed to vary across families, depending on the population from which the family descended.

Formally, let F = {F1, . . . ,F|F|} indicate a fixed partition of {1, . . . , n}, where Ff ⊆ {1, . . . , n}for all f ∈ {1, . . . , |F|}. Then, we assume that genotypes and phenotypes, along with a shared

family effect E, are sampled independently for individuals in different families, from some joint

distribution:

(X(Ff ), Y (Ff ), Ef ) ∼ P fX,Y,E = P fE · PfX · P

fY |X,E . (2)

Above, the random matrix X(Ff ) ∈ R|Ff |×p contains the genotypes (i.e., the rows of X) for all

individuals in the fth family and is sampled jointly from P fX , the random vector Y (Ff ) ∈ R|Ff |

describes the corresponding phenotypes, and Ef is the environmental factor shared by all family

members. Crucially, these distributions are allowed to vary across families, which encodes the

fact that the distribution of genotypes and phenotypes varies across populations. Importantly, the

distributions P fE and P fY |X,E are not assumed to be known; only P fX must be specified to carry

out the procedure. Conditional on Xf and Ef , the phenotypes Y f are assumed to be sampled

independently for each individual:

p(Y (Ff ) | X(Ff ), Ef ) =∏i∈Ff

pf (Y (i) | X(i), Ef ),

for some distribution function pf , which may also vary across families. Lastly, note that X |= E,

which means E is assumed to be an environmental factor unrelated to the genetic effects.

Before proceeding to the mechanics of the test, we pause to explain why testing the hypotheses

in (1) properly already implicitly accounts for the family effects E. Since our motivation is to help

Page 24: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

24

build a better genetic understanding the phenotype, the most intuitive goal within the above setup

is to test the null hypotheses

H0,g : Y |= XGg | X−Gg , E,

for all g ∈ {1, . . . , L}. However, these hypotheses are equivalent to those in (1) because we have

assumed X |= E. Therefore, tests of (1) already account for family effects, as long as we can carry

them out correctly, which is the problem we address here.

The last preliminary step is to formally define the knockoffs. This requires a small extension

of the existing framework,23 because there are now dependent groups of observations (families)

that follow different distributions. Nonetheless, the existing theory can be easily adapted for this

setting. In particular, we can still test the hypotheses in (1) with the knockoff filter,24 as long as

the KnockoffZoom test statistics satisfy the flip-sign property in Lemma 3.3 of Candes et al.23 To

ensure this property holds in the presence of families, the knockoff exchangeability property23 must

be defined in the following way: for any fixed partition G of the variants, we require that swapping

any group of SNPs with the corresponding knockoffs, simultaneously for all family members, would

keep the joint distribution of genotypes and knockoffs statistically indistinguishable. Formally, this

can be written as: (X(Ff ), X(Ff )

)swap(S;G)

d=(X(Ff ), X(Ff )

)∀S ⊆ G, (3)

where Ff indicates a particular family, and swap(S;G) is the operator that swaps all columns of

X(Ff ) indexed by S with the corresponding columns of X(Ff ). In the following, we will develop

a novel algorithm, based on a joint model for the genotypes in each family, to generate knockoff

genotypes satisfying (3).

Modeling haplotypes with population structure

We begin by recalling some useful notation for HMMs.9 We say that a sequence of phased

haplotypes H = (H1, . . . ,Hp), with Hj ∈ {0, 1}, is distributed as an HMM with K hidden states if

there exists a vector of latent random variables Z = (Z1, . . . , Zp), with Zj ∈ {1, . . . ,K}, such that:Z ∼ MC (Q) , (latent discrete Markov chain),

Hj | Z ∼ Hj | Zj ind.∼ fj(Hj | Zj), (emission distribution).

(4)

Above, the Markov chain MC (Q) has initial probabilities Q1 and transition matrices (Q2, . . . , Qp).

Page 25: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

25

Taking inspiration from SHAPEIT,35–37 we assume the i-th haplotype sequence can be approx-

imated as an imperfect mosaic of K other haplotypes in the data set, indexed by {σi1, . . . , σiK} ⊆{1, . . . , 2n} \ {i}. (Note the slight overload of notation: i denotes hereafter a phased haplotype

sequence, two of which are available for each individual). We will discuss later how the references

are determined; for now, we take them as fixed and describe the other aspects of the model. Math-

ematically, the mosaic is described by an HMM in the form of (4), where the distribution of the

latent Markov chain is:

P[Z(i)1 = k] = α

(i)k ,

P[Z(i)j = k′ | Z(i)

j−1 = k] = Q(i)j (k′ | k) =

(1− e−ρdj

)α(i)k′ + e−ρdj , if k′ = k,(

1− e−ρdj)α(i)k′ , if k′ 6= k.

(5)

Above, dj indicates the genetic distance between loci j and j − 1, which is assumed to be fixed

and known (in practice, we will use distances previously estimated in a European population,67

although our method could easily accommodate different distances for different populations). The

parameter ρ > 0 controls the rate of recombination and can be estimated with an expectation-

maximization (EM) technique (Supplementary Methods). However, we have observed it works

well with our data to simply set ρ = 1; this is consistent with the approach of SHAPEIT,35–37

which also uses fixed parameters. The positive weights (α(i)1 , . . . , α

(i)K ) are normalized so that their

sum equals one and they can be interpreted as characterizing the ancestry of the i-th individual.

In the applications discussed within this paper, we simply assume α(i)k = 1/K for all k, although

these parameters could also be estimated by EM (Supplementary Methods). Conditional on the

sequence Z, each element of H follows an independent Bernoulli distribution:

f(i)j (H

(i)j | k) =

1− λj , if H

(i)j = H

(σik)j ,

λj , if H(i)j 6= H

(σik)j .

(6)

Above, the parameter λj is a site-specific mutation rate, which makes the HMM mosaic imperfect.

Earlier works that first proposed this model also suggested formulae for determining suitable values

of the parameters ρ and λ = (λ1, . . . , λp) in terms of the physical distances between SNPs and other

population genetics quantities.8 However, given that we have access to large a data set, we choose

instead to estimate λ by EM (Supplementary Methods). We have noticed that it works well to

also explicitly prevent λ from being too large or too small (e.g., 10−6 ≤ λj ≤ 10−3).

To reduce the computational burden and mitigate the risk of overfitting, the value of K should

not be too large; here, we adopt K = 100. We have observed that larger values improve the

Page 26: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

26

goodness-of-fit relatively little, while reducing power by increasing the similarity between vari-

ables and knockoffs.9 Having thus fixed K, the identities of the reference haplotypes for each i,

{σi1, . . . , σiK}, are chosen in a data-adaptive fashion as those whose ancestry is most likely to be

similar to that of H(i). Concretely, one may carry this out as outlined by Algorithm 1, separately

chromosome-by-chromosome. Different options are available for defining similarities between hap-

lotypes; for simplicity, we use the Hamming distance. Since computing the pairwise distances

between all haplotypes would have quadratic complexity in the sample size, which is unfeasible

for large data, we first divide the haplotypes into clusters of size N , with K � N � 2n (i.e.,

N ≈ 5000), through recursive 2-means clustering, and then we only compute distances within

clusters, following in the footsteps of SHAPEIT v3.37

Algorithm 1 Choosing the HMM reference haplotypes

Input: haplotypes H ∈ {0, 1}2n×p, parameter K;

Input: hyperparameters N1, N2, s.t. K � N1 < N2 � n.

Input: a distance measure ξ between haplotypes.

Divide {1, . . . , 2n} into M sets Cc s.t. N1 ≤ |Cc| ≤ N2, by recursive 2-means clustering.37

for c = 1, . . . ,M do

Compute a distance matrix D ∈ R|Cc|×|Cc| for all haplotypes in Cc, with ξ.

for i in Cc do

Define R(i) as the set of K nearest neighbors of Hi in Cc.

Output: a set R(i) of K references for each haplotype H(i).

While the approach in Algorithm 1 is the easiest to introduce first, in practice it is preferable for

our purpose to utilize a different set of local references in different parts of the same chromosome.

We will describe this extension later, after discussing the novel knockoff generation algorithm.

Generating knockoffs preserving population structure

Conditional on the reference indices, {σi1, . . . , σiK}, the above setup describes each haplotype

sequence H(i) as an HMM; a model for which we already know how to generate knockoffs in

theory.5,9 Therefore, generating knockoffs in our setting is straightforward: all we need to do is to

define the customized HMM for each haplotype sequence, and then to apply the knockoff generation

algorithm previously developed in KnockoffZoom v1.9 This solution is outlined in Algorithm 2.

Page 27: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

27

Algorithm 2 Knockoff haplotypes preserving population structure

Input: haplotypes H ∈ {0, 1}2n×p, genetic map ρ ∈ Rp−1, partition G of {1, . . . , p}; parameter K.

for i = 1, . . . , 2n do

Assign references R(i) = {σi1, . . . , σiK} with Algorithm 1.

Initialize α(i)k ← 1

K , for each k ∈ {1, . . . ,K}.

Estimate λ = (λ1, . . . , λp) by EM (Supplementary Methods), initialize ρ← 1.

for i = 1, . . . , 2n do

Define the HMM {R(i), ρ, λ}.Sample Z(i) = (Z

(i)1 , . . . , Z

(i)p ) from P[Z(i) | H(i)], with step I of Algorithm 3 in Sesia et al.9

Sample a knockoff copy Z(i) of Z(i) with respect to G, with step II of Algorithm 3 in Sesia et al.9

Sample H(i) from P[H(i) | Z(i) = Z(i)], with step III of Algorithm 3 in Sesia et al.9

Output: knockoff haplotypes H ∈ {0, 1}2n×p.

Knockoffs with local reference motifs based on hold-out distances

Relatedness is not necessarily homogeneous across the genome. This is particularly evident in

the case of admixture, which may cause an individual to share haplotypes with a certain population

only in part of a chromosome. Therefore, it is worth extending Algorithms 1–2 to accommodate

different local references within the same chromosome. We implement this as follows.

First, we divide each chromosome into relatively wide genetic windows (10 Mb, for concreteness);

then, we choose the reference haplotypes separately within each of them, based on their similarities

outside the window of interest. In order to allow the choice of references to be as locally adaptive as

possible, we only consider the alleles in the two neighboring windows when computing distances.

This approach is similar to that of SHAPEIT v3,37 although the latter does not hold out the

SNPs in the current window to determine local similarity. Our approach is better suited for

knockoff generation because it reduces overfitting—knockoffs too similar to the original variables—

and consequently increases power. Once the local references for each haplotype have been defined,

we can apply Algorithm 2 window-by-window. To avoid discontinuities at the boundaries, we

consider overlapping windows (expanded by 10 Mb on each side). More precisely, we condition on

all SNPs within 10 Mb when sampling the latent Markov chain (step I of Algorithm 3 in Sesia et

al.9) but then we only generate knockoffs within the window of interest.

Page 28: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

28

Knockoffs preserving familial relatedness

Within an IBD-sharing family of size m, we model the joint distribution of the haplotype

sequences, namely (H(1), . . . ,H(m)), as an HMM with a Km-dimensional latent Markov chain. For

simplicity, we write this as Z(1 :m) = (Z(1), . . . , Z(m)), where Z(i) ∈ {1, . . . ,K}p. Conditional on

Z(i), each element of H(i) is independent and follows the same emission distribution as in (4)–(6):

P[H

(i)j = 1 | Z(i)

j = k]

= f(i)j (1 | k) =

1− λ, if H

(σik)j = 1,

λ, if H(σik)j = 0.

(7)

If Z(1), . . . , Z(m) were also independent of each other, this would reduce to a collection of m HMMs

of the type in (4)–(6). However, to account for familial relatedness, we assume that different Z(i)

are coupled at the IBD segments (which we have previously detected and hold fixed), as described

below.

Denote by ∂(i, j) ⊂ {1, . . . ,m} the set of haplotype indices that share an IBD segment with H(i)

at position j. Let us also define ηi,j = 1/(1 + |∂(i, j)|) ∈ (0, 1]. Then, we model the distribution of

Z(1 :m) as follows. For 1 < j ≤ p,

P[Z

(1 :m)j = z

(1 :m)j | Z(1 :m)

j−1 = z(1 :m)j−1

]=

m∏i=1

(Q

(i)j (z

(i)j | z

(i)j−1)

)ηi,j ∏i′∈∂(i,j)

1[z(i)j = z

(i′)j ], (8)

where the transition matrices Q(i)j are defined as in (5), while

P[Z

(1 :m)1 = (k(1), . . . , k(m))

]=

m∏i=1

(α(i)

k(i)

)ηi,j ∏i′∈∂(i,j)

1[k(i) = k(i′)]. (9)

The first term on the right-hand-side of (8) describes the transitions in the Markov chain, while

the second term is the constraint that makes the haplotypes match along the IBD segments. The

purpose of the ηi,j exponent is to make the marginal distribution of each sequence as consistent

as possible with the model presented earlier for unrelated haplotypes. (If we set instead ηi,j = 1,

transitions of the latent state may occur with significantly different frequency inside and outside

IBD segments). In the trivial cases of families of size one, ∂(1, j) = ∅ and η1,j = 1, for all

j ∈ {1, . . . , p}, so it is easy to see that (8)–(9) reduce to the original model for unrelated haplotypes

in (5). By contrast, in the general case, the latent states for different haplotypes in the same family

are constrained to be identical along all IBD segments. See Supplementary Figure 36 (a) for a

graphical representation of this model.

Generating knockoffs with the existing algorithm for HMMs in this case would have compu-

tational complexity equal to O(npKm), which is unfeasible for large data sets unless m = 1. To

Page 29: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

29

understand why this complexity is O(npKm), note that the model is effectively an HMM with a

Km-dimensional latent Markov chain,9 where each vector-valued variable corresponds to the al-

leles at a specific site for all individuals in the family. Fortunately, one can equivalently look at

the joint distribution of (Z(1:m), H(1:m)) as a more general Markov random field68 with 2×m× pvariables, each taking values in {1, . . . ,K} or {0, 1}, respectively. See Supplementary Figure 36 (b)

for a graphical representation of this model, where each node corresponds to one of the two hap-

lotypes of an individual at a particular marker. The random field perspective opens the door to

more efficient inference and posterior sampling based on message-passing algorithms.69 Leaving the

technical details to the Supplementary Methods, we outline the new knockoff generation procedure

in Algorithm 3.

In a nutshell, we follow the same spirit of the construction for unrelated haplotypes in Algo-

rithm 2, with the important difference that the HMM with a K-dimensional latent Markov chain

of length p is replaced by a latent Markov random field with 2 ×m × p variables, which requires

some innovations.

• First, the K haplotype references in the model for each H(i) are shared by all haplotypes

within the same family; see Supplementary Algorithm 1.

• Second, the posterior sampling of Z(1:m) | H(1:m) is carried out via generalized belief

propagation69 instead of the standard forward-backward procedure for HMMs;5,9 see Sup-

plementary Algorithm 2 and Supplementary Figure 37. (This generally involves some degree

of approximation, but it is exact for many family structures).

• Third, the knockoff copies Z(1:m) of Z(1:m) are generated with a modified version of

the existing algorithm for HMMs9 that circumvents the computational difficulties of the

higher-dimensional model by breaking the couplings between different haplotypes through

conditioning70 upon the extremities of the IBD segments; see Supplementary Algorithm 3

and Supplementary Figure 38. To clarify, conditioning on the extremities of the IBD seg-

ments means that we make the knockoffs identical to the true genotypes for a few sites in

each family, which reduces power only slightly (we consider relatively long IBD segments, so

there are few extremities), but greatly reduces the computational burden (see Supplement

for a full explanation).

It is worth remarking that, for trivial families of size one, this is exactly equivalent to Algorithm 2.

Finally, note that the extension to local references with hold-out distances discussed earlier also

Page 30: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

30

applies seamlessly here. Even though our software implements this extension, we do not explicitly

write down the algorithms with local references in this paper for the sake of clarity and space.

Algorithm 3 Knockoff haplotypes preserving population structure and familial relatedness

Input: H ∈ {0, 1}2n×p, d ∈ Rp−1, G, and K as in Algorithm 2, collection of IBD segments I.

Define the set U ⊆ {1, . . . , n} of haplotype indices that do not share any IBD segments in I.

Divide the remaining haplotypes into L distinct families Ff ⊆ {1, . . . , n}, for f ∈ {1, . . . , L}.for f ∈ 1, . . . , L do

Assign references R(f) = {σf1, . . . , σfK} with Supplementary Algorithm 1.

Initialize α(f)k ← 1

K , for each k ∈ {1, . . . ,K}.

Estimate λ = (λ1, . . . , λp) by EM (Supplementary Methods), initialize ρ← 1.

for f ∈ 1, . . . , L do

Define the HMM {R(f), ρ, λ}.Sample (Z(i))i∈Ff

∈ {1, . . . ,K}m×n from P[(Z(i))i∈Ff| (H(i))i∈Ff

], with Supplementary Algorithm 2.

Sample a knockoff copy (Z(i))i∈Ffof (Z(i))i∈Ff

with respect to G, with Supplementary Algorithm 3.

Sample H(i) from P[H(i) | Z(i) = Z(i)], coordinate by coordinate independently, for all i ∈ Ff .

for i ∈ U do

Generate a knockoff copy H(i) of H(i) as in Algorithm 2.

Output: knockoff haplotype matrix H ∈ {0, 1}2n×p that preserves the IBD segments.

Quality control and data pre-processing for the UK Biobank

We begin with 487,297 genotyped and phased subjects in the UK Biobank (application 27837).

Among these, 147,669 have reported at least one close relative in the data set. We define families by

clustering together individuals with kinship greater or equal to 0.05; then we discard 322 individuals

(chosen as those with the most missing phenotypes) to ensure that no families have size larger

than 10. This leaves us with 136,818 individuals labeled as related, divided into 57,164 families.

The median family size is 2, and the mean is 2.4). The total number of individuals that passed

this quality control (included those who have no relatives) is 486,975. We only analyze biallelic

SNPs with minor allele frequency above 0.1% and in Hardy-Weinberg equilibrium (10−6), among

the subset of 350,119 unrelated British individuals analyzed in previous work.9 The final SNPs

count is 591,513.

Page 31: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

31

Including additional covariates

We account for the sex, age and squared age of the subjects to increase power (squared age is

not used for height), as in earlier work.9,40 These covariates are leveraged in the KnockoffZoom

predictive model, along with the top 5 genetic principal components. The inclusion of the principal

components, which may capture population-wide shifts in the distribution of the phenotype, is

useful to improve power because it increases the accuracy of the predictive model while keeping it

sparse,9 but it is unnecessary to avoid spurious discoveries due to population structure, since our

method already accounts for that through the knockoffs. We thus fit a sparse regression model to

predict Y given [Z,X, X] ∈ Rn×(r+2p), where Z,X, X are the r covariates, the p genotypes, and

the p knockoffs, respectively. The model coefficients for Z are not regularized, and their estimates

are not directly included in the test statistics.

Numerical experiments

Feature importance measures for each SNP are computed in three alternative ways: by fitting the

lasso with cross-validation and taking the absolute value of the estimated regression coefficients;9

by running BOLT-LMM40 and taking the negative logarithm of the marginal p-values; and by

performing univariate logistic regression (in the case of binary phenotypes) and taking the negative

logarithm of the marginal p-values. These models are designed to predict Y given [X, X]; in the first

two cases we also include the top 10 principal components of the genotype matrix (computed on

the entire UK Biobank data set) as additional covariates. Then, the feature importance measures

Tj and Tj , for the jth SNP and its corresponding knockoff, are combined in the usual way to define

the knockoff test statistics for each group Gg ⊆ {1, . . . , p} of variables: Wg =∑

j∈GgTj−

∑j∈Gg

Tj .

In the simulations with related samples, the latent Gaussian variable for the ith individual in

the probit model is given by:

L(i) =

p∑j=1

βjX(i)j + γE(fi) +

√1− γ2ε(i),

where E(fi) and ε(i) are independent standard normal random variables, γ ∈ [0, 1], and fi denotes

the family to which the ith individual belongs. The magnitude of the nonzero genetic coefficients

β is varied as a parameter, to control the total heritability of the trait.

Page 32: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

Supplementary Information for

Controlling the false discovery rate in GWAS with population structure

Matteo Sesia, Stephen Bates, Emmanuel Candes, Jonathan Marchini, Chiara Sabatti

Page 33: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

2

SUPPLEMENTARY METHODS

A. Estimating model parameters by EM

We can estimate the HMM parameters θ = (α, λ, ρ) in (5)–(6), in the main paper, with an

expectation-maximization (EM) method. To write down the algorithm explicitly, we begin by

noting the log-likelihood of θ given both the observable, H, and latent, Z, variables is:

`(θ;H,Z) = log p(H,Z | θ)

=n∑i=1

log p(H(i), Z(i) | θ)

=n∑i=1

log

p∏j=1

Qj(Z(i)j | Z

(i)j−1) ·

p∏j=1

f(i)j (H

(i)j | Z

(i)j )

=

n∑i=1

p∑j=1

logQj(Z(i)j | Z

(i)j−1) +

n∑i=1

p∑j=1

log f(i)j (H

(i)j | Z

(i)j ).

This log-likelihood cannot be directly minimized because we cannot observe Z. Instead, given an

initial estimate of the model parameters, θ(t−1), we iteratively update θ(t) by minimizing

L(θ, θ(t−1)) = EZ[`(θ;H,Z) | H, θ(t−1)

]=

n∑i=1

p∑j=1

EZ[logQj(Z

(i)j | Z

(i)j−1) | H(i), θ(t−1)

]+

n∑i=1

p∑j=1

EZ[log f

(i)j (H

(i)j | Z

(i)j ) | H(i), θ(t−1)

].

(1)

This quantity can be computed and minimized efficiently by leveraging the Markov property, as in

the Baum-Welch algorithm.

Let us begin by defining, for any fixed j ∈ {1, . . . , p}, the posterior marginals

γ(i)j (k) = P

[Z

(i)j = k | H(i), θ(t−1)

].

It is well-known that these quantities can be computed efficiently with the classical forward-

backward iteration that defines the expectation (E) step of the EM algorithm. What remains

to be developed explicitly is the maximization (M) step of the EM algorithm; we will do this in

the following, separately for α, λ, and ρ. These are fairly standard calculations but we outline the

details here for completeness.

Page 34: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

3

1. Estimating the site-specific mutation rate λj

For any fixed j ∈ {1, . . . , p}, the parameter λj appears in the second term of (1):

1

n

n∑i=1

EZ[log f

(i)j (H

(i)j | Z

(i)j ) | H(i), θ(t−1)

]=

1

n

n∑i=1

∑z

log f(i)j (H

(i)j | Z

(i)j )P

[Z(i) = z | H(i), θ(t−1)

]=

1

n

n∑i=1

∑k

log f(i)j (H

(i)j | Z

(i)j = k)P

[Z

(i)j = k | H(i), θ(t−1)

]=

1

n

n∑i=1

∑k

log f(i)j (H

(i)j | Z

(i)j = k)γ

(i)j (k)

=1

n

n∑i=1

∑k

log

[(1− λj)δH(i)

j ,R(i)j (k)

+ λj(1− δH(i)j ,R

(i)j (k)

)

(i)j (k)

= log(1− λj)1

n

n∑i=1

∑k

δH

(i)j ,R

(i)j (k)

γ(i)j (k) + log(λj)

1

n

n∑i=1

∑k

(1− δH

(i)j ,R

(i)j (k)

)γ(i)j (k)

= log(1− λj)(1− Γj) + log(λj)Γj ,

where we have defined:

Γj =1

n

n∑i=1

∑k

(1− δH

(i)j ,R

(i)j (k)

)γ(i)j (k).

The above is maximized at λj = Γj . Therefore, the update rule for λj in the M step is: λj ← Γj .

2. Estimating the recombination scale

The parameter ρ appears in the first term of (1) through:

EZ[logQj(Z

(i)j | Z

(i)j−1) | H(i), θ(t−1)

]=∑z

logQj(zj | zj−1)P[Z(i) = z | H(i), θ(t−1)

]=∑k,l

logQj(k | l)∑

z−(j,j−1)

P[Z(i) = (k, l, z−(j,j−1) | H(i), θ(t−1)

]=∑k,l

logQj(k | l)P[Z

(i)j = k, Z

(i)j−1 = l | H(i), θ(t−1)

].

By defining

ξ(i)j (k, l) = P

[Z

(i)j = k, Z

(i)j−1 = l | H(i), θ(t−1)

],

we can writen∑i=1

p∑j=1

EZ[logQj(Z

(i)j | Z

(i)j−1) | H(i), θ(t−1)

]=

n∑i=1

p∑j=1

∑k,l

logQj(k | l)ξ(i)j (k, l).

Page 35: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

4

We will discuss later how to compute ξ. Now, assume that ξ is available and we want to optimize the

above quantity with respect to the parameter ρ, which is hidden inside the transition matrices Q.

For simplicity, we also assume that α(i)k = 1/K, ∀i, k (we omit the computations for the general

case, which are more complicated). Note that

logQj(k | l) = log

(1− bjK

+ bjδk,l

)= log

(1− bjK

)+

[log

(1− bjK

+ bj

)− log

(1− bjK

)]δk,l

= const. + log (1− bj) + [log (1 + (K − 1)bj)− log (1− bj)] δk,l,

where bj = bj(ρ) = e−ρdj . Therefore,

1

n

n∑i=1

p∑j=1

∑k,l

logQj(k | l)ξ(i)j (k, l)

=1

n

n∑i=1

p∑j=1

log(1− bj)∑k,l

ξ(i)j (k, l) +

1

n

n∑i=1

p∑j=1

[log (1 + (K − 1)bj)− log (1− bj)]∑k

ξ(i)j (k, k)

=

p∑j=1

log(1− bj) +

p∑j=1

[log (1 + (K − 1)bj)− log (1− bj)]1

n

n∑i=1

∑k

ξ(i)j (k, k)

=

p∑j=1

log(1− bj) +

p∑j=1

[log (1 + (K − 1)bj)− log (1− bj)] Ξj ,

where we have defined:

Ξj =1

n

n∑i=1

∑k

ξ(i)j (k, k).

It is easy to verify that the above function is strictly quasiconcave in ρ, so it can be optimized

numerically by solving for its first derivative to be equal to zero. We will include the details of our

procedure later for completeness. Meanwhile, note that the computation of ξ(i)j (k, l) can be easily

obtained from the M step:

ξ(i)j (k, l) = P

[Z

(i)j−1 = l, Z

(i)j = k | H(i)

]∝ P

[Z

(i)j−1 = l, Z

(i)j = k,H(i)

]∝ F (i)

j−1(l)Q(i)j (k | l)f (i)

j (k | H(i)j )B

(i)j (k)

= ξ(i)j (k, l),

Page 36: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

5

where F and B denote the forward and backward weights in the forward-backward algorithm. The

normalization constant for ξ(i)j (k, l) is:

∑k

∑l

ξ(i)j (k, l) =

∑k

∑l

F(i)j−1(l)Q

(i)j (k | l)f (i)

j (k | H(i)j )B

(i)j (k)

=∑l

F(i)j−1(l)

∑k

[aj + bjδk,l]f(i)j (k | H(i)

j )B(i)j (k)

= aj

(∑l

F(i)j−1(l)

)∑k

f(i)j (k | H(i)

j )B(i)j (k) + bj

∑k

F(i)j−1(k)f

(i)j (k | H(i)

j )B(i)j (k)

= aj∑k

f(i)j (k | H(i)

j )B(i)j (k) + bj

∑k

F(i)j−1(k)f

(i)j (k | H(i)

j )B(i)j (k)

=∑k

f(i)j (k | H(i)

j )B(i)j (k)

[aj + bjF

(i)j−1(k)

].

The diagonal elements of ξ are proportional to:

ξ(i)j (k, k) = F

(i)j−1(k)Q

(i)j (k | k)f

(i)j (k | H(i)

j )B(i)j (k)

= F(i)j−1(k) [aj + bj ] f

(i)j (k | H(i)

j )B(i)j (k).

Recall that we care about

Ξj =1

n

n∑i=1

∑k

ξ(i)j (k, k) =

1

n

n∑i=1

1∑k

∑l ξ

(i)j (k, l)

∑k

ξ(i)j (k, k),

which we can compute starting from

∑k

ξ(i)j (k, k) =

∑k

F(i)j−1(k) (aj + bj) f

(i)j (k | H(i)

j )B(i)j (k)

= (aj + bj)∑k

F(i)j−1(k)f

(i)j (k | H(i)

j )B(i)j (k).

Going back to the details of optimizing

1

n

n∑i=1

p∑j=1

∑k,l

logQj(k | l)ξ(i)j (k, l),

note that differentiating with respect to ρ yields:

0 = −p∑j=1

b′j1− bj

+

p∑j=1

b′j

[K − 1

1 + (K − 1)bj+

1

1− bj

]Ξj .

By definition of bj(ρ) = e−ρdj , it follows that b′j = −djbj . Therefore,

p∑j=1

djbj1− bj

=

p∑j=1

djbj

[K − 1

1 + (K − 1)bj+

1

1− bj

]Ξj = Ψ(ρ),

Page 37: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

6

where we have defined:

Ψ(ρ) =

p∑j=1

djbj(ρ)

[K − 1

1 + (K − 1)bj(ρ)+

1

1− bj(ρ)

]Ξj .

Define also d = 1p

∑pj=1 dj . Then, we want to solve

Ψ(ρ) =

p∑j=1

djbj1− bj

= e−ρdp∑j=1

dj1− bj

e−ρ(dj−d) = e−ρdΦ(ρ),

where

Φ(ρ) =

p∑j=1

dj1− bj

e−ρ(dj−d).

Therefore, we can solve iteratively for ρ∗:

ρ∗ = −1

dlog

(Ψ(ρ∗)Φ(ρ∗)

).

Upon convergence (which we observe empirically but do not prove), the solution ρ∗ will give us

the M update for ρ in the EM algorithm: ρ← ρ∗.

3. Estimating the motif prevalences

For any fixed j ∈ {1, . . . , p}, the parameter α(i)k appears in the first term of (1) through:

logQj(k | l) = log(

(1− bj)α(i)k + bjδk,l

)= log

((1− bj)α(i)

k

)+[log(

(1− bj)α(i)k + bj

)− log

((1− bj)α(i)

k

)]δk,l

= (1− δk,l) logα(i)k + log

[(1− bj)α(i)

k + bj

]δk,l.

Therefore,

p∑j=1

∑k,l

logQj(k | l)ξ(i)j (k, l)

=∑k

log(α(i)k )

p∑j=1

∑l

ξ(i)j (k, l)−

∑k

log(α(i)k )

p∑j=1

ξ(i)j (k, k)

+

p∑j=1

∑k

log[(1− bj)α(i)

k + bj

(i)j (k, k).

Differentiating this with respect to α(i)k gives:

0 =1

α(i)k

p∑j=1

∑l

ξ(i)j (k, l)− 1

α(i)k

p∑j=1

ξ(i)j (k, k) +

p∑j=1

1− bj(1− bj)α(i)

k + bjξ

(i)j (k, k)

=η(k)− ηα

(i)k

+

p∑j=1

1− bj(1− bj)α(i)

k + bjξ

(i)j (k, k),

Page 38: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

7

where

η(k) =

p∑j=1

∑l

ξ(i)j (k, l), η =

p∑j=1

ξ(i)j (k, k).

In order to impose the constraint∑

k α(i)k = 1, we add a Lagrange multiplier W :

0 = −W +η(k)− ηα

(i)k

+

p∑j=1

1− bj(1− bj)α(i)

k + bjξ

(i)j (k, k)

= −Wα(i)k + (η(k)− η) + α

(i)k

p∑j=1

1− bj(1− bj)α(i)

k + bjξ

(i)j (k, k).

Therefore,

α(i)k =

1

W

η(k)− η + α(i)k

p∑j=1

1− bj(1− bj)α(i)

k + bjξ

(i)j (k, k)

.

We can solve this iteratively, setting W =∑

k α(i)k after each update of α(i). Upon convergence

(which we observe empirically but do not prove), the solution α(i)∗ will then give the M update in

the EM algorithm: α(i) ← α(i)∗.

B. Knockoffs preserving familial relatedness

1. Choosing the haplotype references

Supplementary Algorithm 1 modifies Algorithm 1 to ensure that: (i) IBD-sharing haplotypes

are not used as references for one another; (ii) all haplotypes in the same IBD-sharing family have

the same references.

Page 39: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

8

Supplementary Algorithm 1 Choosing reference haplotypes preserving familial constraints

Input: H ∈ {0, 1}2n×p, K, and N1, N2 as in Algorithm 1;

Input: a collection of IBD-sharing families F1, . . . , FL, a distance measure ξ between haplotypes.

Divide the haplotypes into M sets Cc using ξ as in Algorithm 1, preserving the family structure.

for c = 1, . . . ,M do

Compute a distance matrix D ∈ R|Cc|×|Cc| for all haplotypes in Cc.

for i in Cc do

if ∃ l such that i ∈ Fl then

Define R(i) as the set of K nearest neighbors of Hi in Cc \ Fl.

else

Define R(i) as the set of K nearest neighbors of Hi in Cc.

for l in 1, . . . , L do

Initialize R(l) = ∩i∈FlR(i).

for i ∈ Fl do

Update R(i) = R(i) \ R(l).

if |R(l)| < K then

Update R(l) = R(l) ∪R(i).

else

break.

for i ∈ Fl do

Set R(i) = R(l).

Output: a set R(i) of K references for each haplotype H(i).

2. Posterior sampling via belief propagation

Conditional on H(1:m), the distribution of Z(1:m) is a Markov random field with m×p variables,

characterized by Equations (7)–(9) in the main paper. In order to sample Z(1:m) | H(1:m), we

implement belief propagation1 (BP) as follows. For any i ∈ {1, . . . ,m} and j ∈ {1, . . . , p − 1},denote by µ(i,j)→(i,j+1) ∈ RK the forward message from Z

(i)j to Z

(i)j+1. It is easy to verify that this

must satisfy the following recursive definition:

µ(i,j)→(i,j+1)(k) =

K∑l=1

[Q

(i)j+1(k | l)

]ηi,j+1 · f (i)j (H

(i)j | l) · µ(i,j−1)→(i,j)(l) ·

∏i′∈∂(i,j)

µ(i′,j)→(i,j)(l),

Page 40: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

9

where it is understood that µ(i,0)→(i,1)(k) = 1, for all i and k. Above, µ(i′,j)→(i,j) indicates the

vertical message from Z(i′)j to Z

(i)j , for any i ∈ ∂(i, j). By the BP rules, this satisfies:

µ(i′,j)→(i,j)(k) =K∑l=1

δk,l · µ(i′,j−1)→(i′,j)(l) · µ(i′,j+1)→(i′,j)(l) ·∏

i′′∈∂(i′,j)\{i}µ(i′′,j)→(i′,j)(l),

where δk,l = 1 if k = l and 0 otherwise. Above, µ(i′,j+1)→(i′,j)(l) indicates the backward message

from Z(i′)j+1 to Z

(i′)j , which is defined recursively as:

µ(i,j)→(i,j−1)(k) =K∑l=1

[Q

(i)j (l | k)

]ηi,j · f (i)j (H

(i)j | l) · µ(i,j+1)→(i,j)(l) ·

∏i′∈∂(i,j)

µ(i′,j)→(i,j)(l).

Again, it is understood that µ(i,p+1)→(i,p)(k) = 1, for all i and k. Combined, the above message

update rules define a BP algorithm that is in principle already applicable to approximately sample

Z(1:m) | H(1:m). However, these recursion relations can be simplified by observing that Z(i)j = Z(i′)

whenever i′ ∈ ∂(i, j). Therefore, the corresponding nodes in the Markov random field can be

collapsed and treated as a single unit in the generalized belief propagation framework1 (GBP).

Thus, after defining

φ(i)j (l) = f

(i)j (H

(i)j | l) ·

∏i′∈∂(i,j)

f(i′)j (H

(i′)j | l),

ψ(i)j (k | l) =

[Q

(i)j (k | l)

]ηi,j · ∏i′∈∂(i,j)

[Q

(i′)j (k | l)

]ηi,j,

(2)

it is not difficult to verify that the GBP messages are given by:

µ(i,j)→(i,j+1)(k) =K∑l=1

ψ(i)j+1(k | l) · φ(i)

j (l) · µ(i,j−1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l),

µ(i,j)→(i,j−1)(k) =

K∑l=1

ψ(i)j (l | k) · φ(i)

j (l) · µ(i,j+1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l),

µ(i,j)→(i′,j+1)(k) = µ(i,j)→(i,j+1)(k), ∀i′ ∈ ∂(i, j + 1),

µ(i,j)→(i′,j−1)(k) = µ(i,j)→(i,j+1)(k), ∀i′ ∈ ∂(i, j − 1).

(3)

The GBP rules written above can be simplified even further analytically. Assuming for simplicity

that α(i)k = 1/K (as it is the case in our applications), we can write the transition matrices Q as:

Q(i)j (k | l) = Qj(k | l) = aj + bj1[k = l], aj =

1

K

(1− e−ρdj

), bj = e−ρdj .

Page 41: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

10

Therefore,

ψ(i)j (k | l) =

[Q

(i)j (k | l)

]ηi,j · ∏i′∈∂(i,j)

[Q

(i′)j (k | l)

]ηi,j= [Qj(k | l)]ηi,j(1+|∂(i,j)|) = Qj(k | l).

This simplification allows us to equivalently rewrite the forward update rule in (3) as:

µ(i,j)→(i,j+1)(k) =

K∑l=1

[aj+1 + bj+11[k = l]] · φ(i)j (l) · µ(i,j−1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l)

= aj+1

K∑l=1

φ(i)j (l) · µ(i,j−1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l)+

+ bj+1φ(i)j (k) · µ(i,j−1)→(i,j)(k)

·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(k) ·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(k),

(4)

which has the advantage of having complexity O(K) to evaluate, rather than O(K2). Similarly,

we can rewrite the backward update rule as:

µ(i,j)→(i,j−1)(k) =K∑l=1

[aj + bj1[k = l]] · φ(i)j (l) · µ(i,j+1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l)

= aj

K∑l=1

φ(i)j (l) · µ(i,j+1)→(i,j)(l)

·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(l) ·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(l)

+ bj · φ(i)j (k) · µ(i,j+1)→(i,j)(k)

·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(k) ·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(k),

(5)

which can also be evaluated at cost O(K).

The GBP formulation incorporates the IBD-sharing constraints implicitly, removing the vertical

messages and the corresponding small loops in the Markov random field. Even though some loops

may remain in the graphical model (e.g., if the same two haplotypes share two different IBD

segments), these will generally be large compared to the range of background LD, since we only

consider relatively long IBD segments. Therefore, we can expect the GBP approximation to work

Page 42: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

11

well in general. Furthermore, in many practical cases, the resulting Markov random field is a tree,

so the GBP solution will be very fast to compute and provide exact posterior probabilities.1

GBP randomly initializes the messages µ(i,j)→(i′,j+1) and µ(i,j)→(i′,j−1), for all i, j and i′ ∈∂(i, j), and then recursively updates them until convergence according to the rules laid out in (3).

A schematic visualization of the updates is shown in Supplementary Figure 37. Even though

convergence to an exact solution is only theoretically guaranteed if the underlying graph structure

is a tree, the method often performs well in practice, especially if the graph is locally tree-like (i.e.,

it may have long loops but no short ones).2

Once the BP messages converge, the posterior distribution of Z(i)j | H(1:m) can be approximated

with the product of its incoming messages:

P[Z

(i)j = k | H(1:m)

]≈ µ(i,j−1)→(i,j)(k) · µ(i,j+1)→(i,j)(k)

·∏

i′∈∂(i,j)\∂(i,j−1)

µ(i′,j−1)→(i,j)(k) ·∏

i′∈∂(i,j)\∂(i,j+1)

µ(i′,j+1)→(i,j)(k).(6)

Crucially, the above relation is exact in the case of trees, which includes the previously well-known

example of a single haplotype sequence,3,4 as well as many non-trivial family structures (e.g., two

haplotypes sharing one IBD segment).

Since we are ultimately interested in sampling all coordinates of Z(1:m) | H(1:m) jointly, our

procedure does not end with (6). In general, after sampling Z(i)j | H(1:m) for some i and j, one

should update the Markov random field by conditioning on the observed value of Z(i)j and update all

messages until convergence before sampling the next variable, which is computationally unfeasible.

Fortunately, this procedure can be greatly simplified in our case because we only have relatively

long IBD segments, and thus there are few loops in the graphical model. We leverage this fact by

first sampling Z(i)j for all (i, j) in the set J ⊆ {1, . . . ,m} × {1, . . . , p} of junction nodes:

J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}. (7)

Although this requires running |J | instances of GBP, this number will typically be small. Fur-

thermore, warm starts can decrease the number of iterations required for convergence. Once Z(i)j

has been sampled for all (i, j) ∈ J , the remaining random field is a collection of disjoint Markov

chains, as visualized in Supplementary Figure 38. Therefore, posterior sampling can be carried

out very efficiently with a simple forward-backward procedure that does not involve running BP

at each step, as outlined in Supplementary Algorithm 2.

Page 43: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

12

Supplementary Algorithm 2 Posterior sampling preserving familial constraints

Input: H ∈ {0, 1}m×p, K, list of IBD segments {∂(i, j)}i∈{1,...,m},j∈{1,...,p};Input: a set R(i) of K references for each haplotype H(i).

Define the list of junction nodes J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}.Initialize the list of active nodes A = {1, . . . ,m} × {1, . . . , p} and denote its complement as Ac.

Initialize the forward messages µ(i,j)→(i,j+1)(k) = 1K , for all i ∈ {1, . . . ,m} and j ∈ {1, . . . , p− 1}.

Initialize the backward messages µ(i,j)→(i,j−1)(k) = 1K , for all i ∈ {1, . . . ,m} and j ∈ {2, . . . , p}.

for (i∗, j∗) ∈ J ∩ A do

while messages not converged do

for j = 1, . . . , p− 1 do

for i = 1, . . . ,m do

if (i, j) ∈ A then

Update µ(i,j)→(i,j+1)(k), for all k ∈ {1, . . . ,K}, according to (4).

for j = p, . . . , 2 do

for i = 1, . . . ,m do

if (i, j) ∈ A then

Update µ(i,j)→(i,j−1)(k), for all k ∈ {1, . . . ,K}, according to (5).

Approximate the posteriors w(i∗)j∗ (k) of Z

(i∗)j∗ = k | H(1:m), {Z(i)

j }(i,j)∈Ac based on (6).

Sample Z(i∗)j∗ from P[Z

(i∗)j∗ = k] = w

(i∗)j∗ (k).

Update the list of active nodes: A ← A \ {(i∗, j∗)}.Update the Markov random field: φ

(i∗)j∗ (k)← 1[k = Z

(i∗)j∗ ], for each k ∈ {1, . . . , k}.

for i′ ∈ ∂(i∗, j∗) do

Set Z(i′)j∗ ← Z

(i∗)j∗ .

Update the list of active nodes: A ← A \ {(i′, j∗)}.Update the Markov random field: φ

(i′)j∗ (k)← 1[k = Z

(i′)j∗ ], for each k ∈ {1, . . . , k}.

Sample each disjoint segment of {Z(i)j }(i,j)∈J c | H(1:m), {Z(i)

j }(i,j)∈J , with standard forward-backward.4

Output: a latent Markov random field Z ∈ {1, . . . ,K}m×p that preserves the IBD structure.

3. Knockoff generation via conditioning

Having sampled Z(1:m) | H(1:m) with the procedure described above, we proceed to develop

a method for generating knockoff copies Z(1:m). Even though constructing exact knockoffs for a

general Markov random field may be computationally unfeasible, we can simplify the problem by

conditioning on some variables.5 In particular, we condition on all variables at the junctions of some

IBD segment, i.e., those in the set J defined in (7). This operation transforms the graphical model

Page 44: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

13

for the remaining variables into a collection of disjoint one-dimensional chains, for which knockoffs

can be generated independently with the existing methods;3,4 see Supplementary Figure 38. This

solution is summarised in Supplementary Algorithm 3.

Supplementary Algorithm 3 Related knockoff haplotypes via conditioning

Input: H ∈ {0, 1}m×p, d ∈ Rp−1, G, and K as in Algorithm 2;

Input: IBD segments {∂(i, j)}i∈{1,...,m},j∈{1,...,p};Input: a set R(i) of K references for each haplotype H(i);

Input: Markov random field states Z ∈ {1, . . . ,K}m×p.

Define the list of junction nodes J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}.for (i, j) ∈ J do

Define G as the group in partition G to which variant j belongs.

for j′ ∈ G do

Expand the list of junction nodes: J ← J ∪ {(i, j′)}.

for (i, j) ∈ J do

Make a trivial knockoff: Z(i)j ← Z

(i)j .

for each connected component C in {1, . . . ,m} × {1, . . . , p} \ J do

Generate group knockoffs {Z(i)j }(i,j)∈C of {Z(i)

j }(i,j)∈C | {Z(i)j }(i,j)∈J as in previous work.4

Output: knockoff matrix Z ∈ {1, . . . ,K}m×p.

Page 45: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

14

SUPPLEMENTARY NOTES

A. Review of HMMs for genotype data

In the context of ancestry inference,6 the K states of the Markov chain in the HMM symbolize

the possible ancestry groups, and transitions represent admixture.7 A typical parametrization is:

P[Z(i)1 = k] = α

(i)k , P[Z

(i)j = k′ | Z(i)

j−1 = k] =

e−rj + (1− e−rj )α(i)

k′ , if k′ = k,

(1− e−rj )α(i)k′ , if k′ 6= k.

(8)

Above, rj > 0 controls the rate of admixture and is related to the physical distance between

consecutive loci, while the positive weights (α(i)1 , . . . , α

(i)K ) define the ancestry of the i-th individual

and are normalized so that their sum equals one. Finally, the Bernoulli emission distributions

fj(· | k) summarize the allele frequencies in each ancestral group.

By simply assigning different values and interpretations to the parameters of this model, one can

also obtain a good description of LD in homogeneous populations. For instance, the fastPHASE8

HMM takes the following form:

P[Z(i)1 = k] = αj,k, P[Z

(i)j = k′ | Z(i)

j−1 = k] =

e−rj + (1− e−rj )αj,k′ , if k′ = k,

(1− e−rj )αj,k′ , if k′ 6= k.

(9)

Here, the discrete latent states are understood to represent clusters of similar haplotypes rather

than precise ancestral groups, and the values of the rj parameters are typically larger, so that

transitions in the Markov chain occur more frequently, consistently with the shorter range of LD.

Note that, since the α parameters above do not depend on i, this model implicitly assumes all

individuals to be equally related.

B. Limitations of the fastPHASE HMM

Previous work on knockoffs for GWAS data3,4 adopted an HMM in the form of (9), with pa-

rameters estimated using fastPHASE. While this model can provide a good description of the

distribution of genotypes in an homogeneous samples, it cannot simultaneously account for popu-

lation structure. This problem is illustrated below by a numerical experiment.

We simulate haplotypes from a toy HMM that mimics the presence of different ancestries and

possible admixture, as well as LD. In particular, we consider a Markov chain of length p = 500, with

8 possible states divided into 2 ancestral classes of equal size. The Markov chain parameters are

Page 46: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

15

defined such that transitions occur more frequently within rather than across classes (Supplemen-

tary Figure 1). Conditional on the Markov chain, the emission distributions at different positions

are independent Bernoulli with randomly fixed probabilities of success. We refer to the 8 sequences

of success probabilities corresponding to each state of the Markov chain as the motifs of this HMM.

These motifs have been randomly generated so that those within the same ancestral class are more

similar to each other than to those in the other class (Supplementary Figure 2). With this model,

we generate n = 10, 000 independent haplotype sequences and then perform a PCA to quantify the

population structure present in this synthetic data set (Supplementary Figure 3). By contrasting

the results of this analysis to those obtained with data generated from an HMM in the form of (9),

it becomes clear that our toy model induces significant population structure.

Supplementary Figure 3 suggests that knockoffs based on an estimated HMM in the form of (9)

cannot be valid in this context. We validate this conjecture by applying fastPHASE to the above

synthetic data set, setting K = 8, and utilize the estimated HMM to generate knockoffs with the

usual algorithm.4 Unsurprisingly, the PCA in Supplementary Figure 4 shows that the knockoffs

based on the fastPHASE HMM do not correctly preserve the population structure. Note that our

choice of K = 8 in the fastPHASE HMM corresponds to the number of latent states in the true

model, which is known in this toy example. Increasing K within the fastPHASE model would not

qualitatively change the results, since it would not address the main limitation that transitions

between any two different Markov states are equally likely.

C. Preview of our method

We give a preview of the proposed method in the toy example of Supplementary Note B.

We anticipate it will preserve the population structure much better than the fastPHASE HMM,

as shown in Supplementary Figure 5. In this example, knockoffs are constructed for groups of

50 variants each; if the resolution is even lower, as in Supplementary Figure 6, the fastPHASE

knockoffs completely break population structure, while our new method remains approximately

valid. In fact, lower-resolution knockoffs tend to be more loosely coupled to the data,4 so they are

naturally more sensitive to model misspecification; see also Supplementary Figure 7.

We can further examine the robustness of knockoffs at different resolutions in this toy example

by comparing their exchangeability with the original variables in terms of their second moments. In

particular, we compare values of cor(Xj , Xk) with cor(Xj , Xk), for j, k ∈ {1, . . . , p}, in Supplemen-

tary Figure 8. Here, we see that the knockoffs constructed with the fastPHASE HMM can preserve

Page 47: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

16

short-range LD but not long-range dependencies due to population structure, at low resolution.

By contrast, the new knockoffs based on the SHAPEIT HMM seem approximately valid at all res-

olutions. Additional results consistent with this observation are given in Supplementary Figure 9,

where we compare values of cor(Xj , Xk) with cor(Xj , Xk), for j, k belonging to different groups in

{1, . . . , p}. We emphasize that valid knockoffs should produce scatter plots along the 45-degree line

in both Supplementary Figures 8–9. Therefore, we can conveniently summarise the goodness-of-fit

in both plots by computing the average mean square deviations from the reference 45-degree line,

relative to the values of cor(Xj , Xk). This information is displayed in Supplementary Figure 10 at

different resolutions, confirming again that the knockoffs based on the SHAPEIT HMM are more

robust to population structure, especially at lower resolutions.

Having verified that we can generate approximately valid knockoffs preserving population struc-

ture, we need to confirm that these are not trivially identical to the original variables. Therefore,

we describe in Supplementary Figures 11–12 the pairwise absolute correlations between each vari-

able and its corresponding knockoff. Roughly speaking, higher correlations will result in lower

power.9 These results show the new knockoffs should be almost as powerful as those obtained with

the fastPHASE HMM at low resolution, although they may tend to be less powerful at higher

resolutions. Even though this may be a reasonable price to pay for valid inference, we shall see

later that the loss in power on real data is lower than this toy example may suggest.

Finally, we give a first direct preview of the impact of population structure on the validity of

knockoff-based inference by generating a random response variable from a simple homoscedastic

linear model with 50 nonzero coefficients, and then computing knockoff test statistics, separately

with both types of knockoffs. More precisely, we define the following marginal importance measures:

Tj = − log(pj), Tj = − log(pj),

where pj and pj indicate the p-values for the marginal null hypotheses of no association between

Y and Xj or Xj , respectively. Then, we define the knockoff test statistics for each group Gg of

variables as usual: Wg =∑

j∈GgTj −

∑j∈Gg

Tj . Since marginal statistics are known to be more

susceptible to the confounding effect of population structure compared to multivariate methods,10

this choice is designed to emphasize the limitations of the fastPHASE HMM in this example. The

histogram of the test statistics for the null groups is shown in Supplementary Figure 13. We know

the distribution should be symmetric around zero if the knockoffs are valid;9 this appears to be

the case for the new knockoffs, but not for those based on the fastPHASE HMM, which are clearly

skewed to the right and therefore will tend to induce an excess of false positives.

Page 48: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

17

D. Enrichment analysis with external summary statistics

We perform an enrichment analysis using external summary statistics from the Japan Biobank

project11 for the continuous traits, and from the FinnGen resource12 for all binary traits except

respiratory disease. These summary statistics were computed on data independent of those in the

UK Biobank, but some care must be exercised in the interpretation of these results because: (a)

the external statistics measure marginal association, not conditional importance; (b) the external

sample sizes are smaller than ours, which limits power. Despite these difficulties, which we address

as explained below, we find the enrichment analysis to be informative.

For each of group of SNPs Gg in the genome partition at the 20 kb resolution, we compute a

chi-square statistics with Fisher’s method: χ2g = −∑j∈Gg

log(pj), where {pj}j∈Gg denotes the set

of external marginal p-values within the region spanned by Gg. Since the UK Biobank and the

FinnGen project are based on different genome builds, our discoveries are matched to the external p-

values after appropriately lifting the physical positions. We then define: {χ2g}Snovel as the collection

of external Fisher statistics corresponding to our novel discoveries in Snovel; {χ2g}Sconfirmed as the

collection of external Fisher statistics corresponding to our previously confirmed discoveries (either

confirmed by BOLT-LMM, or by the other studies based on the GWAS catalog, and the Japan

Biobank or the FinnGen resource at the genome-wide significance level); and {χ2g}background as the

set of Fisher statistics for groups that are neither in Sconfirmed nor in Snovel.

We take the empirical distribution of {χ2g}background as the null distribution, and invert it to

compute an approximate enrichment p-value for each group in Snovel; we refer to these as penrichg .

The null hypothesis, under which the penrichg would be approximately uniformly distributed, is that

the Fisher statistics for the novel discoveries have the same distribution as those in {χ2g}background;

for instance, we expect this would be the case if all SNPs in our selected groups were independent

of the trait of interest. In theory, we could use these p-values in combination with any multiple

testing procedure; however, this turns out to have little power, due to the relatively small sample

size (compared to the UK Biobank) of the external data. However, it is clear that the empirical dis-

tribution of {penrichg }Snovel is not uniform, which suggest many discoveries are non-null. Therefore,

we take an empirical Bayes approach to estimate the proportion of non-null discoveries,13 as imple-

mented by the “quantile” method in the fdrtool R package.14 This computes an estimate of the

proportion of null enrichment p-values, which we bootstrap 10,000 times to assess its uncertainty.

Table III and Supplementary Tables 16–17 are based on the corresponding mean bootstrap values,

while Supplementary Tables 18–19 report the corresponding 90% bootstrap confidence intervals.

Page 49: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

18

I. SUPPLEMENTARY FIGURES

Sample: 1 Sample: 2 Sample: 3

Pop

.structure

Hom

ogeneous

0 250 500 0 250 500 0 250 500

2

4

6

8

2

4

6

8

Variable

Markovchain

state

Ancestry

1

2

SUPPLEMENTARY FIGURE 1. Toy Markov chains with and without population structure. Visu-

alization of three independent trajectories of the latent Markov chain in a toy HMM simulating haplotypes

with population structure (top). The colors indicate the two possible ancestral classes, within which transi-

tions among the 8 possible states are more likely. This is contrasted to a toy HMM without any population

structure, where all state transitions are equally likely (bottom).

12345678

SUPPLEMENTARY FIGURE 2. Toy HMM motifs with population structure. Visualization of the

emission distributions for each of the 8 possible Markov states in the toy HMM of Supplementary Figure 1.

Darker shades of grey indicate higher probabilities of Hj = 1, for each of the 500 variable indices j. The

clustering dendrogram on the right-hand side shows that motifs in the same ancestral class are more similar

to each other.

Page 50: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

19

-0.05

0.00

0.05

-0.06 -0.03 0.00 0.03 0.06First PC (6.13%)

Secon

dPC

(2.77%)

(a) Pop. structure

-0.05

0.00

0.05

0.10

-0.10 -0.05 0.00 0.05 0.10First PC (0.39%)

Secon

dPC

(0.38%

)

(b) Homogeneous

SUPPLEMENTARY FIGURE 3. PCA for a toy HMM with population structure. PCA of 10,000

haplotypes generated from the toy HMM of Supplementary Figure 1, with (a) and without (b) population

structure. The percentages indicate the proportion of genetic variance explained by the top principal compo-

nents. For simplicity, only 1000 randomly chosen points are shown explicitly. The color of each point reflects

the proportion of red and blue motifs (Supplementary Figure 2) in the Markov chain for the corresponding

sample (Supplementary Figure 1)—purple points indicate admixture of red and blue motifs.

-0.05

0.00

0.05

-0.06 -0.03 0.00 0.03 0.06First PC (6.13%)

Sec

ond

PC

(2.7

7%)

(a) Real genotypes

-0.10

-0.05

0.00

0.05

0.10

-0.08 -0.04 0 0.04 0.08First PC (2.64%)

Sec

on

dP

C(1

.52%

)

(b) Knockoffs (fastPHASE HMM)

SUPPLEMENTARY FIGURE 4. PCA with population structure and fastPHASE knockoffs. Prin-

cipal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary Figure 1,

and of the corresponding knockoffs based on the fastPHASE model (b). Knockoffs for groups of size 50.

Other details are as in Supplementary Figure 3. Note that the colors in (b) are based on the motifs in the

real genotytpes, as in (a), not those in the knockoffs.

Page 51: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

20

-0.05

0.00

0.05

-0.08 0 0.08First PC (6.13%)

Secon

dPC

(2.77%)

(a) Real data

-0.05

0.00

0.05

-0.08 0 0.08First PC (5.40%)

Secon

dPC

(2.30%

)

(b) SHAPEIT

-0.10

-0.05

0.00

0.05

0.10

-0.08 0 0.08First PC (2.64%)

Secon

dPC

(1.52%

)

(c) fastPHASE

SUPPLEMENTARY FIGURE 5. PCA for a toy HMM with population structure and knockoffs.

Principal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary

Figure 1, of the corresponding knockoffs based on the SHAPEIT model (b), and of the knockoffs based on

the fastPHASE model (c). Knockoffs for groups of size 50. Other details are as in Supplementary Figure 4.

-0.05

0.00

0.05

-0.08 0 0.08First PC (6.13%)

SecondPC

(2.77%

)

(a) Real data

-0.05

0.00

0.05

-0.08 0 0.08First PC (5.96%)

SecondPC

(2.73%

)

(b) SHAPEIT

-0.04

0.00

0.04

-0.08 0 0.08First PC (1.41%)

Secon

dPC

(1.34%

)

(c) fastPHASE

SUPPLEMENTARY FIGURE 6. PCA for a toy HMM with population structure and knockoffs.

Principal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary

Figure 1, of the corresponding knockoffs based on our new method (b), and of the knockoffs based on the

fastPHASE model (c). Knockoffs for groups of size 200. Other details are as in Supplementary Figure 5.

Page 52: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

21

First PC Second PC

1 2 5 20 100 500 1 2 5 20 100 5000

2

4

6

8

Group size

Variance

explained

(%)

Model

SHAPEIT

fastPHASE

SUPPLEMENTARY FIGURE 7. PCA of knockoffs robustness to population structure. Proportion

of variance explained by the first (left) and second (right) principal component of knockoffs in the toy

example of Supplementary Figure 5, as a function of the knockoff resolution. The black horizontal lines

indicate the proportion of variance explained by principal components on the original data, which should

be approximately matched (up to finite-sample variations) by valid knockoffs.

Group size: 1 Group size: 20 Group size: 100

SHAPEIT

fastPHASE

-0.75 0.00 0.75 -0.75 0.00 0.75 -0.75 0.00 0.75

-0.75

0.00

0.75

-0.75

0.00

0.75

corr(Xj , Xk)

corr(X

j,X

k) Range

Short (d = 1)

Medium (20 < d < 25)

Long (100 < d < 110)

SUPPLEMENTARY FIGURE 8. Second-moment knockoff exchangeability diagnostics. Ex-

changeability diagnostics for different knockoffs, in the example of Supplementary Figure 7. We compare

cor(Xj , Xk) with cor(Xj , Xk), for j, k ∈ {1, . . . , p}, as a function of the distance between j and k. These

diagnostics should approximately lie on the 45-degree line if the knockoffs are valid.

Page 53: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

22

Group size: 1 Group size: 20 Group size: 100

SHAPEIT

fastPHASE

-0.75 0.00 0.75 -0.75 0.00 0.75 -0.75 0.00 0.75

-0.75

0.00

0.75

-0.75

0.00

0.75

corr(Xj , Xk)

corr(X

j,X

k) Range

Short (d = 1)

Medium (20 < d < 25)

Long (100 < d < 110)

SUPPLEMENTARY FIGURE 9. Additional exchangeability diagnostics for in toy example. We

compare cor(Xj , Xk) to cor(Xj , Xk), for j, k in different groups. Other details are as in Supplementary

Figure 8.

Page 54: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

23

Short range (d = 1) Medium range (20 < d < 25) Long range (100 < d < 110)Fullsw

apPartia

lsw

ap

1 2 5 20 100 500 1 2 5 20 100 500 1 2 5 20 100 500

0

25

50

75

100

0

25

50

75

100

Group size

Meansquared

loss

ofcorrelation(%

) Model

SHAPEIT

fastPHASE

SUPPLEMENTARY FIGURE 10. Second-moment knockoff goodness-of-fit in toy example.

Mean loss of correlation between variables upon: full swap with knockoffs (top); partial swap with

knockoffs (bottom). This is defined as∑

j,k[cor(Xj , Xk) − cor(Xj , Xk)]2/∑

j,k[cor(Xj , Xk)]2 (top), or∑j,k[cor(Xj , Xk) − cor(Xj , Xk)]2/

∑j,k[cor(Xj , Xk)]2 (bottom); the sums are taken over pairs of indices

whose distances are within the specified range. Valid knockoffs should have values close to zero. Equiva-

lently, this summarises the relative deviation of the scatter plots in Supplementary Figures 8–9 from the

45-degree line.

Page 55: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

24

Group size: 1 Group size: 20 Group size: 100SHAPEIT

fastPHASE

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

0

100

200

300

0

100

200

300

|corr(Xj , Xj)|

Cou

nt

SUPPLEMENTARY FIGURE 11. Pairwise similarities between variables and knockoffs. His-

tograms of |corr(Xj , Xj)| in the example of Supplementary Figure 3, for different knockoff constructions

and different levels of resolution.

0.00

0.25

0.50

0.75

1.00

1 2 5 20 100 500Group size

|corr(X

j,X

j)|

Model

SHAPEIT

fastPHASE

SUPPLEMENTARY FIGURE 12. Similarity of variables and knockoffs at different resolutions.

Average |corr(Xj , Xj)| in the toy example of Supplementary Figure 3, as a function of the resolution.

Equivalently, this summarises the mean quantities in Supplementary Figure 11 for knockoffs at different

levels of resolution.

Page 56: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

25

fastPHASE (+243, -157, p-value: 2.0e-05) new (+202, -198, p-value: 8.8e-01)

-5 0 5 -5 0 5

1

10

50

100

200

Null test statistics

Cou

nt

SUPPLEMENTARY FIGURE 13. Distribution of null knockoff test statistics in toy example.

Histogram of knockoff test statistics (based on marginal p-values, as defined in Supplementary Note C) for

null groups (of size 50), for a simulated phenotype in the toy example of Supplementary Figure 3. The

counts at the top indicate the numbers of statistics with positive or negative signs; the p-value is obtained

from an exact binomial test of the null hypothesis that positive and negative statistics are equally likely.

Page 57: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

26

-0.4

-0.2

0.0

0.2

-0.4 -0.2 0.0 0.2First principal component

Secondprincipal

compon

ent

Ethnicity

British

Irish

Indian

Chinese

Caribbean

African

(a)

fastPHASE SHAPEIT

single-S

NP

425kb

-0.4 -0.2 0.0 0.2 -0.4 -0.2 0.0 0.2

-0.4

-0.2

0.0

0.2

-0.4

-0.2

0.0

0.2

First principal component

Secon

dprincipal

compon

ent

(b)

SUPPLEMENTARY FIGURE 14. PCA of individuals with diverse ancestries, and of knockoffs.

The first two genetic principal components of 10,000 individuals in the UK Biobank with one of 6 possible

self-reported ancestries (a) are compared to the corresponding quantities computed on knockoffs at different

resolutions (b). The knockoffs based on the SHAPEIT HMM preserve population structure quite accurately,

even at low resolution. By contrast, the fastPHASE HMM tends to produce knockoffs that shrink together

individuals with diverse ancestries, thus breaking population structure.

Page 58: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

27

0.000

0.025

0.050

0.075

0.100

single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution

Var

ian

ceex

pla

ined

Knockoffs

SHAPEIT

fastPHASE

SUPPLEMENTARY FIGURE 15. Variance explained by principal components of knockoffs. Pro-

portion of genetic variance explained by the first ten principal components of knockoffs at different resolu-

tions, for samples with diverse ancestries. The dashed horizontal line indicates the corresponding quantity

computed on the original data. Other details are as in Supplementary Figure 14.

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] Mb

SHAPEIT

fastPHASE

0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0

0.0

0.5

1.0

0.0

0.5

1.0

|cor(Xj , Xk)|

|cor(X

j,X

k)|

SUPPLEMENTARY FIGURE 16. High-resolution knockoff exchangeability, diverse ancestries.

Exchangeability diagnostics of knockoffs based on different HMMs, for 10,000 individuals with diverse an-

cestries, as in Supplementary Figure 14. We compare |cor(Xj , Xk)| with |cor(Xj , Xk)|, for j, k ∈ {1, . . . , p},as a function of the distance between variants j and k on chromosome 22. Only 1000 randomly chosen

points are shown, for clarity. Variants with minor allele frequency smaller than 0.01 are not shown here, due

to the limited sample size. These diagnostics should approximately lie on the 45-degree line if the knockoffs

are valid.4 Knockoff resolution: single-SNP groups.

Page 59: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

28

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbSHAPEIT

fastPHASE

0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0

0.0

0.5

1.0

0.0

0.5

1.0

|cor(Xj , Xk)|

|cor(X

j,X

k)|

SUPPLEMENTARY FIGURE 17. Low-resolution knockoff exchangeability, diverse ancestries.

Exchangeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. Knockoff

resolution: 425 kb. Other details are as in Supplementary Figure 16.

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] Mb

SHAPEIT

fastPHASE

0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0

0.0

0.5

1.0

0.0

0.5

1.0

|cor(Xj , Xk)|

|cor(X

j, X

k)|

SUPPLEMENTARY FIGURE 18. Additional diagnostics of high-resolution exchangeability. Ex-

changeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. We compare

|cor(Xj , Xk)| with |cor(Xj , Xk)|, for j, k in different groups, as a function of the distance between variants

j and k. Other details are as in Supplementary Figure 16.

Page 60: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

29

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbSHAPEIT

fastPHASE

0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0

0.0

0.5

1.0

0.0

0.5

1.0

|cor(Xj , Xk)|

|cor(X

j, X

k)|

SUPPLEMENTARY FIGURE 19. Additional diagnostics of low-resolution exchangeability. Ex-

changeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. Knockoff

resolution: 425 kb. Other details are as in Supplementary Figure 18.

Page 61: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

30

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbFullsw

apPartia

lsw

ap

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

0.00

0.01

0.02

0.03

0.00

0.01

0.02

0.03

Resolution

Meansquaredloss

ofcorrelation

Model SHAPEIT fastPHASE

SUPPLEMENTARY FIGURE 20. Second-moment knockoff goodness-of-fit, diverse ancestries.

Mean squared loss of correlation between variables upon: full swap with knockoffs (top); partial swap with

knockoffs (bottom). This is defined as [cor(Xj , Xk) − cor(Xj , Xk)]2 (top), or [cor(Xj , Xk) − cor(Xj , Xk)]2

(bottom), each averaged over pairs of variables j, k whose physical distances are within the specified range,

similarly to the diagnostics in Supplementary Figure 10. In words, these diagnostics summarise the aver-

age distances from the 45-degree line in the scatter plots of Supplementary Figures 16–19, including also

intermediate levels of resolutions. Valid knockoffs should have values close to zero.

Page 62: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

31

0.00

0.25

0.50

0.75

1.00

single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution

Pairw

isesimilarity

Model

SHAPEIT

fastPHASE

SUPPLEMENTARY FIGURE 21. Similarity between genotypes and knockoffs, diverse ancestries.

Average absolute pairwise correlation between genotypes and knockoffs on chromosome 22 for 10,000 sam-

ples with diverse ancestries in the UK Biobank, as a function of the resolution. Other details are as in

Supplementary Figure 14.

single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kb

SHAPEIT

fastPHASE

0 1 0 1 0 1 0 1 0 1 0 1 0 1

0

500

1000

1500

2000

0

500

1000

1500

2000

Pairwise similarity

count

SUPPLEMENTARY FIGURE 22. Similarity between genotypes and knockoffs, diverse ancestries.

Histograms of absolute pairwise correlations between genotypes and knockoffs at different resolutions. Other

details are as in Supplementary Figure 21.

Page 63: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

32

Null - fastPHASE

Null - SHAPEIT

-0.02 0.00 0.02

0

500

1000

1500

2000

0

500

1000

1500

2000

Test statistics

Causal - fastPHASE

Causal - SHAPEIT

-0.025 0.000 0.025 0.050 0.075 0.100

50

100150

50

100150

Test statistics

SUPPLEMENTARY FIGURE 23. Knockoff statistics in simulation with diverse ancestries. His-

togram of lasso-based knockoff test statistics for null (left) and causal (right) groups of variants, for a

simulated phenotype and real genotypes with population structure, as in Supplementary Figure 14. The

knockoffs are constructed by different algorithms at resolution equal to 425 kb.

single-SNP 208 kb 425 kb

FD

RP

ower

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Heritability

Method

KnockoffZoom v2

KnockoffZoom v1

SUPPLEMENTARY FIGURE 24. Simulations with diverse ancestries, strong signals. Knockoff-

Zoom performance at different levels of resolution in simulations with real genotypes and artificial pheno-

types. The number of causal variants is equal to 100. Other details are as in Figure 1.

Page 64: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

33

100 200 500F

DR

Pow

erR

esolution

(kb)

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.00.10.20.30.40.5

0.00

0.25

0.50

0.75

0

200

400

600

800

Heritability

Method KnockoffZoom v2 KnockoffZoom v1

SUPPLEMENTARY FIGURE 25. Performance in simulations with diverse ancestries. Knockoff-

Zoom performance in simulations with real genotypes and artificial phenotypes. The discoveries at different

resolutions are combined, counting only the most specific findings in each locus.4 The results corresponding

to phenotypes with different numbers of causal variants are shown in separate columns. Other details are

as in Figure 1 and Supplementary Figure 24.

Page 65: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

34

Null - fastPHASE

Null - SHAPEIT

-100 -50 0 50 100

1000

20003000

1000

20003000

Test statistics

Causal - fastPHASE

Causal - SHAPEIT

-50 0 50 100

4080120160

4080120160

Test statistics

SUPPLEMENTARY FIGURE 26. Knockoff statistics in simulation with diverse ancestries. His-

togram of LMM-based knockoff test statistics for null (left) and causal (right) groups of variants. Other

details are as in Supplementary Figure 23.

1

10

100

1000

0.1 0.2 0.3 0.4 0.5Pairwise kinship

Cou

nt

SUPPLEMENTARY FIGURE 27. Kinship in 10,000 related samples from the UK Biobank. His-

togram of pairwise kinship values between the 10,000 related British samples in the UK Biobank used for

our numerical experiments. Kinship is defined here as the fraction of shared DNA (e.g., 100% for identical

twins, 50% for parents or full siblings, 25% for half siblings, 12.5% for first cousins).

Page 66: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

35

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22Chromosome

Width

(Mb)

SUPPLEMENTARY FIGURE 28. IBD segments in 10,000 related samples from the UK Biobank.

Violin plots displaying the distribution of IBD segment lengths in the 10,000 related British samples in the

UK Biobank used for our numerical experiments.

10.0

12.5

15.0

17.5

20.0

single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution

Width

(Mb)

Relatedness

Preserved

Ignored

SUPPLEMENTARY FIGURE 29. Width of IBD segments in knockoffs for related samples. Average

width of IBD segments on chromosome 22 in knockoffs for 10,000 related samples in UK Biobank, as in

Supplementary Figure 27. The dashed horizontal line indicates the corresponding quantity computed on the

data. The knockoffs are generated with our new method, either preserving or ignoring familial relatedness.

Page 67: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

36

(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] MbFullsw

apPartia

lsw

ap

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

single-SNP

3kb

20kb

41kb

81kb

208kb

425kb

0.00

0.01

0.02

0.03

0.00

0.01

0.02

0.03

Resolution

Meansquaredloss

ofcorrelation

Relatedness Preserved Ignored

SUPPLEMENTARY FIGURE 30. Second-moment goodness-of-fit of knockoffs for related samples.

Goodness-of-fit of knockoffs at different resolutions, using data from 10,000 related British samples in the

UK Biobank, defined as in Supplementary Figure 20. Valid knockoffs should have values close to zero. The

knockoffs are generated with our new method, either preserving or ignoring familial relatedness.

0.00

0.25

0.50

0.75

1.00

single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution

Pairw

isesimilarity

Relatedness

Preserved

Ignored

SUPPLEMENTARY FIGURE 31. Similarity between genotypes and knockoffs for related samples.

Average absolute pairwise correlation between genotypes and knockoffs on chromosome 22, as a function of

the resolution. Other details are as in Supplementary Figure 30.

Page 68: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

37

Resolution: single-SNP Resolution: 41 kb Resolution: 425 kbFDR

Pow

er

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

Heritability

Relatedness

Preserved

Ignored

SUPPLEMENTARY FIGURE 32. Performance in simulations with related samples. KnockoffZoom

v2 performance at different levels of resolution on related samples, using test statistics based on sparse

logistic regression, as in Figure 2. Strong environmental effects, γ = 1. Other details are as in Figure 2.

γ: 0 γ: 0.655 γ: 1

FDR

Pow

er

0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Heritability

Relatedness

Preserved

Ignored

SUPPLEMENTARY FIGURE 33. Performance in simulations with marginal statistics. Knockoff-

Zoom v2 performance with marginal statistics. Other details are as in Figure 2. Note that marginal statistics

have almost no power, although an excess of false discoveries occurs if the relatedness is not preserved.

Page 69: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

38

γ = 0, Ignored

γ = 0, Preserved

-0.04 -0.02 0.00 0.02 0.04

1

10

100

1000

1

10

100

1000

Test statistics

γ = 1, Ignored

γ = 1, Preserved

-0.04 -0.02 0.00 0.02 0.04

1

10

100

1000

1

10

100

1000

Test statistics

SUPPLEMENTARY FIGURE 34. Lasso knockoff statistics in simulation with related samples.

Histogram of lasso-based test statistics for null groups of variants, for simulated phenotypes and real geno-

types from 10,000 related samples in the UK Biobank, as in Supplementary Figure 29. (a): no environmental

effects within families; (b): strong environmental effects within families. Resolution equal to 425 kb. Little

difference can be observed when γ = 0, while γ = 1 induces a small but clear (notice the log scale on the

vertical axis) rightward bias in the test statistics if the relatedness is ignored.

γ = 0, Ignored

γ = 0, Preserved

-100 -50 0 50 100

1

10

100

1000

1

10

100

1000

Test statistics

γ = 1, Ignored

γ = 1, Preserved

-100 -50 0 50 100

1

10

100

1000

1

10

100

1000

Test statistics

SUPPLEMENTARY FIGURE 35. Marginal knockoff statistics in simulation with related samples.

Histogram of marginal knockoff test statistics for null groups of variants in a simulation with related samples.

Other details are as in Supplementary Figure 34.

Page 70: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

39

Za1 Za

2 Za3 Za

4 Za5 Za

6 Za7 Za

8

Zb1 Zb

2 Zb3 Zb

4 Zb5 Zb

6 Zb7 Zb

8

(a)

Za1

Z2 Z3 Z4

Za5 Za

6

Z7 Z8

Zb1 Zb

5 Zb6

(b)

SUPPLEMENTARY FIGURE 36. Model for the latent states in the HMM for haplotype families.

Graphical representation of the distribution of latent states in the HMM for two haplotype sequences of

length 8 sharing 2 IBD segments (shaded). (a) Representation as a Markov chain with K2 possible states in

each position, with the constraint that nodes connected by a vertical edge must be identical to each other.

(b): Equivalent representation of this model as a Markov random field with 11 variables, each taking one of

K possible values.

Za1

Z2 Z3 Z4

Za5 Za

6

Z7 Z8

Zb1 Zb

5 Zb6

t− 2 t− 1t

t− 1

t− 2

SUPPLEMENTARY FIGURE 37. Visualization of belief propagation for haplotype families. Belief

propagation update of a message in the example of Supplementary Figure 36. The new message evaluated

here at time t is that from the third node of the first shared IBD segment to the successive node in the first

haplotype sequence (bold arrow). This is computed as a function of the messages labeled as t−1, which had

previously been computed as a function of those labeled as t − 2. Red: forward messages; blue: backward

messages.

Page 71: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

40

Za1

Z2 Z3 Z4

Za5 Za

6

Z7 Z8

Zb1 Zb

5 Zb6

Z2

Z2

Z4

Z4

Z7

Z7

(a)

Za1

Z2 Z3 Z4

Za5 Za

6

Z7 Z8

Zb1 Zb

5 Zb6

Z2

Z2

Z4

Z4

Z7

Z7loop

(b)

SUPPLEMENTARY FIGURE 38. Graphical model and conditioning for haplotype families. Graph-

ical model for related haplotypes in the example of Supplementary Figure 36. (a): the nodes at the extremi-

ties of the IBD segments are shaded. (b): conditional on the extremities of the IBD segments, the remaining

latent nodes are distributed as independent Markov chains.

Page 72: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

41

II. SUPPLEMENTARY TABLES

Median width

(kb)

Mean width

(kb)

Number of

groups

Median size

(SNPs)

Mean size

(SNPs)

single-SNP single-SNP 591513 1 1

3 11 151532 3 4

20 41 56562 8 10

41 74 33929 14 17

81 134 19500 26 30

208 303 8863 58 67

425 575 4738 113 125

SUPPLEMENTARY TABLE 1. Genome partitions at different resolutions. Summary of 7 partitions

of the genome into disjoint groups of contiguous SNPs. The first column (median width in kb) will be used

to reference particular resolutions throughout this chapter.

Ethnicity Count

African 1710

British 1710

Caribbean 1710

Chinese 1450

Indian 1710

Irish 1710

SUPPLEMENTARY TABLE 2. Ethnicities of 10,000 unrelated individuals in the UK Biobank.

Summary of the self-reported ethnicities for the individuals in the UK Biobank used in our simulations.

Family size Number of families Average kinship

1 1 N.A.

2 4702 0.273

3 193 0.270

4 4 0.265

SUPPLEMENTARY TABLE 3. Family structure and average kinship of 10,000 related samples.

Summary of the self-reported family structure for 10,000 British individuals used in our simulations. These

families are chosen as those with the largest average kinship, which is defined as in Supplementary Figure 27.

One extra individual is included without relatives to bring the total number to a round value.

Page 73: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

42

Name Description Number of cases UK Biobank Fields UK Biobank Codes

bmi body mass index continuous 21001-0.0

cvd cardiovascular disease 148715 20002-0.0–20002-0.32 1065, 1066, 1067, 1068,

1081, 1082, 1083, 1425,

1473, 1493

diabetes diabetes 19897 20002-0.0–20002-0.32 1220

height standing height continuous 50-0.0

hypothyroidism hypothyroidism 22493 20002-0.0–20002-0.32 1226

platelet platelet count continuous 30080-0.0

respiratory respiratory disease 64945 20002-0.0–20002-0.32 1111, 1112, 1113, 1114,

1115, 1117, 1413, 1414,

1415, 1594

sbp systolic blood pressure continuous 4080-0.0, 4080-0.1

SUPPLEMENTARY TABLE 4. Phenotype definitions. Definition of the UK Biobank phenotypes used

in our analysis.4 In the case of case-control phenotypes, the number of cases refers to the subset of individuals

that passed our quality control.

Page 74: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

43

Everyone British White (non-British)

Phenotype Resolution all unrel. all unrel. all unrel.

cvd single-SNP 0 0 0 0 0 0

3 kb 22 20 0 0 0 0

20 kb 239 152 169 140 0 0

41 kb 339 235 270 181 0 0

81 kb 566 428 611 462 0 0

208 kb 940 594 815 611 0 0

425 kb 1089 861 1004 711 0 0

diabetes single-SNP 0 0 0 0 0 0

3 kb 0 17 0 12 0 0

20 kb 83 44 63 53 0 0

41 kb 123 86 82 57 0 0

81 kb 193 152 165 129 0 0

208 kb 262 242 217 171 0 0

425 kb 383 346 289 291 0 0

hypothyroidism single-SNP 19 22 11 11 0 0

3 kb 40 42 60 32 0 0

20 kb 105 79 109 86 0 0

41 kb 222 156 164 130 0 0

81 kb 277 173 269 153 0 0

208 kb 295 257 288 256 0 0

425 kb 335 309 312 266 0 0

respiratory single-SNP 0 0 0 11 0 0

3 kb 21 0 37 0 0 0

20 kb 61 33 54 33 0 0

41 kb 109 62 73 66 0 0

81 kb 109 84 63 94 0 0

208 kb 113 79 119 84 0 0

425 kb 194 102 186 139 0 0

SUPPLEMENTARY TABLE 5. Discoveries for UK Biobank phenotypes (binary). Numbers of

discoveries at different resolutions, using different subsets of the UK Biobank samples. Binary phenotypes.

Page 75: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

44

Everyone British White (non-British)

Phenotype Resolution all unrel. all unrel. all unrel.

bmi single-SNP 95 64 80 69 0 10

3 kb 570 483 609 377 0 0

20 kb 1503 1294 1610 1412 25 20

41 kb 2384 1966 2353 2141 74 80

81 kb 3006 2768 3002 2681 91 80

208 kb 3339 3111 3370 3117 112 101

425 kb 3073 2922 2938 2735 170 104

height single-SNP 0 0 0 12 0 0

3 kb 10 10 0 0 0 0

20 kb 343 230 207 180 0 0

41 kb 918 566 820 492 0 0

81 kb 1480 1194 1433 1280 0 0

208 kb 2395 1938 2381 1975 0 0

425 kb 2460 2109 2426 2092 0 10

platelet single-SNP 53 52 34 52 0 0

3 kb 246 259 223 202 0 0

20 kb 1002 820 977 777 26 31

41 kb 1261 995 1171 944 52 44

81 kb 1570 1350 1502 1292 69 55

208 kb 1743 1583 1809 1510 53 51

425 kb 1653 1550 1741 1521 76 60

sbp single-SNP 0 0 0 0 0 0

3 kb 83 90 42 32 0 0

20 kb 191 162 166 127 0 0

41 kb 511 353 421 342 0 0

81 kb 830 635 736 585 0 0

208 kb 1183 972 1050 911 0 0

425 kb 1543 1202 1401 1273 0 0

SUPPLEMENTARY TABLE 6. Discoveries for UK Biobank phenotypes (continuous). Numbers of

discoveries for continuous phenotypes. Other details are as in Supplementary Figure 5.

Page 76: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

45

KnockoffZoom v2 BOLT-LMM

Phenotype Resolution Discoveries Overlap with LMM Discoveries Overlap with KZ

cvd 3 kb 22 22 (100.0%) 257 25 (9.7%)

20 kb 239 180 (75.3%) 257 189 (73.5%)

41 kb 339 212 (62.5%) 257 213 (82.9%)

81 kb 566 261 (46.1%) 257 241 (93.8%)

208 kb 940 274 (29.1%) 257 249 (96.9%)

425 kb 1089 255 (23.4%) 257 254 (98.8%)

diabetes 3 kb 21 20 (95.2%) 62 21 (33.9%)

20 kb 61 45 (73.8%) 62 47 (75.8%)

41 kb 109 54 (49.5%) 62 52 (83.9%)

81 kb 109 50 (45.9%) 62 54 (87.1%)

208 kb 113 52 (46.0%) 62 55 (88.7%)

425 kb 194 57 (29.4%) 62 59 (95.2%)

hypothyroidism single-SNP 19 19 (100.0%) 143 30 (21.0%)

3 kb 40 40 (100.0%) 143 53 (37.1%)

20 kb 105 89 (84.8%) 143 101 (70.6%)

41 kb 222 128 (57.7%) 143 123 (86.0%)

81 kb 277 133 (48.0%) 143 130 (90.9%)

208 kb 295 129 (43.7%) 143 142 (99.3%)

425 kb 335 122 (36.4%) 143 142 (99.3%)

respiratory 20 kb 83 60 (72.3%) 94 62 (66.0%)

41 kb 123 74 (60.2%) 94 75 (79.8%)

81 kb 193 83 (43.0%) 94 85 (90.4%)

208 kb 262 82 (31.3%) 94 92 (97.9%)

425 kb 383 82 (21.4%) 94 93 (98.9%)

SUPPLEMENTARY TABLE 7. Comparison with BOLT-LMM (binary). KnockoffZoom discoveries

for binary phenotypes using all UK Biobank samples vs. BOLT-LMM genome-wide significant discoveries (5×10−8). BOLT-LMM is applied on 459k European samples for cardiovascular disease and hypothyroidism,15

and on 350k unrelated British samples for diabetes and respiratory disease.4

Page 77: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

46

KnockoffZoom v2 BOLT-LMM

Phenotype Resolution Discoveries Overlap with LMM Discoveries Overlap with KZ

bmi 3 kb 10 10 (100.0%) 697 15 (2.2%)

20 kb 343 309 (90.1%) 697 317 (45.5%)

41 kb 918 618 (67.3%) 697 548 (78.6%)

81 kb 1480 792 (53.5%) 697 641 (92.0%)

208 kb 2395 898 (37.5%) 697 689 (98.9%)

425 kb 2460 794 (32.3%) 697 695 (99.7%)

height single-SNP 95 95 (100.0%) 2464 225 (9.1%)

3 kb 570 570 (100.0%) 2464 891 (36.2%)

20 kb 1503 1469 (97.7%) 2464 1761 (71.5%)

41 kb 2384 2167 (90.9%) 2464 2167 (87.9%)

81 kb 3006 2417 (80.4%) 2464 2360 (95.8%)

208 kb 3339 2228 (66.7%) 2464 2430 (98.6%)

425 kb 3073 1804 (58.7%) 2464 2454 (99.6%)

platelet single-SNP 53 53 (100.0%) 1204 131 (10.9%)

3 kb 246 245 (99.6%) 1204 391 (32.5%)

20 kb 1002 900 (89.8%) 1204 963 (80.0%)

41 kb 1261 1041 (82.6%) 1204 1075 (89.3%)

81 kb 1570 1120 (71.3%) 1204 1138 (94.5%)

208 kb 1743 1057 (60.6%) 1204 1183 (98.3%)

425 kb 1653 911 (55.1%) 1204 1195 (99.3%)

sbp 3 kb 83 83 (100.0%) 568 101 (17.8%)

20 kb 191 177 (92.7%) 568 204 (35.9%)

41 kb 511 366 (71.6%) 568 380 (66.9%)

81 kb 830 496 (59.8%) 568 480 (84.5%)

208 kb 1183 561 (47.4%) 568 530 (93.3%)

425 kb 1543 538 (34.9%) 568 548 (96.5%)

SUPPLEMENTARY TABLE 8. Comparison with BOLT-LMM (continuous). KnockoffZoom dis-

coveries for continuous phenotypes using all UK Biobank samples vs. BOLT-LMM genome-wide significant

discoveries (5× 10−8) using 459k European samples.

Page 78: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

47

Resolution KnockoffZoom v2 KnockoffZoom v1

v2 v1 Discoveries Overlap with v1 Discoveries Overlap with v2

cvd

41 kb 42 kb 339 49 (14.5%) 51 46 (90.2%)

81 kb 88 kb 566 175 (30.9%) 182 165 (90.7%)

208 kb 226 kb 940 449 (47.8%) 514 446 (86.8%)

425 kb 226 kb 1089 453 (41.6%) 514 466 (90.7%)

diabetes

3 kb 4 kb 21 8 (38.1%) 11 8 (72.7%)

20 kb 18 kb 61 9 (14.8%) 10 9 (90.0%)

41 kb 42 kb 109 19 (17.4%) 21 19 (90.5%)

81 kb 88 kb 109 28 (25.7%) 33 28 (84.8%)

208 kb 226 kb 113 45 (39.8%) 50 46 (92.0%)

425 kb 226 kb 194 48 (24.7%) 50 48 (96.0%)

hypothyroidism

single-SNP single-SNP 19 8 (42.1%) 21 8 (38.1%)

81 kb 88 kb 277 103 (37.2%) 108 100 (92.6%)

208 kb 226 kb 295 183 (62.0%) 212 186 (87.7%)

425 kb 226 kb 335 188 (56.1%) 212 194 (91.5%)

respiratory

20 kb 18 kb 83 12 (14.5%) 13 13 (100.0%)

41 kb 42 kb 123 35 (28.5%) 41 35 (85.4%)

81 kb 88 kb 193 61 (31.6%) 65 59 (90.8%)

208 kb 226 kb 262 132 (50.4%) 176 140 (79.5%)

425 kb 226 kb 383 154 (40.2%) 176 159 (90.3%)

SUPPLEMENTARY TABLE 9. Comparison with KnockoffZoom v1 (binary). KnockoffZoom v2

discoveries using all UK Biobank British samples vs. KnockoffZoom v1 discoveries using 350k unrelated

British samples; the latter are obtained using slightly different genome partitions.4 Binary phenotypes.

Page 79: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

48

Resolution KnockoffZoom v2 KnockoffZoom v1

v2 v1 Discoveries Overlap with v1 Discoveries Overlap with v2

bmi

3 kb 4 kb 10 7 (70.0%) 24 7 (29.2%)

20 kb 18 kb 343 29 (8.5%) 33 30 (90.9%)

41 kb 42 kb 918 61 (6.6%) 60 58 (96.7%)

81 kb 88 kb 1480 515 (34.8%) 555 485 (87.4%)

208 kb 226 kb 2395 1653 (69.0%) 1804 1615 (89.5%)

425 kb 226 kb 2460 1592 (64.7%) 1804 1733 (96.1%)

height

single-SNP single-SNP 95 68 (71.6%) 173 68 (39.3%)

3 kb 4 kb 570 252 (44.2%) 336 251 (74.7%)

20 kb 18 kb 1503 360 (24.0%) 388 350 (90.2%)

41 kb 42 kb 2384 832 (34.9%) 823 780 (94.8%)

81 kb 88 kb 3006 1864 (62.0%) 1976 1836 (92.9%)

208 kb 226 kb 3339 2775 (83.1%) 3284 3021 (92.0%)

425 kb 226 kb 3073 2398 (78.0%) 3284 3198 (97.4%)

platelet

single-SNP single-SNP 53 40 (75.5%) 143 40 (28.0%)

3 kb 4 kb 246 136 (55.3%) 161 138 (85.7%)

20 kb 18 kb 1002 264 (26.3%) 276 265 (96.0%)

41 kb 42 kb 1261 398 (31.6%) 408 385 (94.4%)

81 kb 88 kb 1570 856 (54.5%) 890 834 (93.7%)

208 kb 226 kb 1743 1288 (73.9%) 1460 1325 (90.8%)

425 kb 226 kb 1653 1162 (70.3%) 1460 1393 (95.4%)

sbp

41 kb 42 kb 511 86 (16.8%) 95 84 (88.4%)

81 kb 88 kb 830 265 (31.9%) 297 262 (88.2%)

208 kb 226 kb 1183 619 (52.3%) 722 612 (84.8%)

425 kb 226 kb 1543 663 (43.0%) 722 678 (93.9%)

SUPPLEMENTARY TABLE 10. Comparison with KnockoffZoom v1 (continuous). KnockoffZoom

v2 discoveries using all UK Biobank British samples vs. KnockoffZoom v1 discoveries using 350k unrelated

British samples. Continuous phenotypes. Other details are as in Supplementary Figure 9.

Page 80: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

49

Confirmed

Phenotype Resolution Discoveries Catalog Japan FinnGen Any

cvd 3 kb 22 21 (95.5%) NA 11 (50.0%) 22 (100.0%)

20 kb 239 173 (72.4%) NA 81 (33.9%) 188 (78.7%)

41 kb 339 241 (71.1%) NA 126 (37.2%) 266 (78.5%)

81 kb 566 353 (62.4%) NA 251 (44.3%) 422 (74.6%)

208 kb 940 524 (55.7%) NA 581 (61.8%) 738 (78.5%)

425 kb 1089 671 (61.6%) NA 837 (76.9%) 967 (88.8%)

diabetes 3 kb 21 20 (95.2%) 13 (61.9%) 8 (38.1%) 20 (95.2%)

20 kb 61 54 (88.5%) 26 (42.6%) 18 (29.5%) 54 (88.5%)

41 kb 109 88 (80.7%) 36 (33.0%) 30 (27.5%) 88 (80.7%)

81 kb 109 88 (80.7%) 39 (35.8%) 36 (33.0%) 89 (81.7%)

208 kb 113 95 (84.1%) 49 (43.4%) 43 (38.1%) 97 (85.8%)

425 kb 194 140 (72.2%) 58 (29.9%) 59 (30.4%) 142 (73.2%)

hypothyroidism single-SNP 19 7 (36.8%) NA 3 (15.8%) 7 (36.8%)

3 kb 40 23 (57.5%) NA 14 (35.0%) 24 (60.0%)

20 kb 105 71 (67.6%) NA 20 (19.0%) 71 (67.6%)

41 kb 222 101 (45.5%) NA 27 (12.2%) 105 (47.3%)

81 kb 277 126 (45.5%) NA 38 (13.7%) 135 (48.7%)

208 kb 295 141 (47.8%) NA 50 (16.9%) 156 (52.9%)

425 kb 335 139 (41.5%) NA 74 (22.1%) 174 (51.9%)

respiratory 20 kb 83 74 (89.2%) NA 35 (42.2%) 76 (91.6%)

41 kb 123 110 (89.4%) NA 58 (47.2%) 114 (92.7%)

81 kb 193 155 (80.3%) NA 115 (59.6%) 174 (90.2%)

208 kb 262 195 (74.4%) NA 202 (77.1%) 241 (92.0%)

425 kb 383 263 (68.7%) NA 330 (86.2%) 357 (93.2%)

SUPPLEMENTARY TABLE 11. Confirmatory comparison for binary traits. Numbers of Knockof-

fZoom v2 discoveries at different resolutions (all UK Biobank samples) containing associations previously

reported in the GWAS Catalog, Japan Biobank resource, FinnGen resource, or any of the above.

Page 81: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

50

Confirmed

Phenotype Resolution Discoveries Catalog Japan FinnGen Any

bmi 3 kb 10 10 (100.0%) 4 (40.0%) NA 10 (100.0%)

20 kb 343 307 (89.5%) 32 (9.3%) NA 308 (89.8%)

41 kb 918 655 (71.4%) 53 (5.8%) NA 656 (71.5%)

81 kb 1480 865 (58.4%) 55 (3.7%) NA 865 (58.4%)

208 kb 2395 1076 (44.9%) 64 (2.7%) NA 1076 (44.9%)

425 kb 2460 1090 (44.3%) 68 (2.8%) NA 1091 (44.3%)

height single-SNP 95 63 (66.3%) 57 (60.0%) NA 81 (85.3%)

3 kb 570 357 (62.6%) 258 (45.3%) NA 417 (73.2%)

20 kb 1503 1032 (68.7%) 483 (32.1%) NA 1102 (73.3%)

41 kb 2384 1534 (64.3%) 572 (24.0%) NA 1607 (67.4%)

81 kb 3006 1822 (60.6%) 590 (19.6%) NA 1879 (62.5%)

208 kb 3339 1856 (55.6%) 561 (16.8%) NA 1886 (56.5%)

425 kb 3073 1653 (53.8%) 494 (16.1%) NA 1669 (54.3%)

platelet single-SNP 53 37 (69.8%) 22 (41.5%) NA 41 (77.4%)

3 kb 246 153 (62.2%) 72 (29.3%) NA 168 (68.3%)

20 kb 1002 352 (35.1%) 97 (9.7%) NA 374 (37.3%)

41 kb 1261 391 (31.0%) 97 (7.7%) NA 409 (32.4%)

81 kb 1570 426 (27.1%) 91 (5.8%) NA 436 (27.8%)

208 kb 1743 445 (25.5%) 94 (5.4%) NA 453 (26.0%)

425 kb 1653 425 (25.7%) 86 (5.2%) NA 429 (26.0%)

sbp 3 kb 83 69 (83.1%) 10 (12.0%) NA 69 (83.1%)

20 kb 191 166 (86.9%) 17 (8.9%) NA 166 (86.9%)

41 kb 511 358 (70.1%) 22 (4.3%) NA 359 (70.3%)

81 kb 830 517 (62.3%) 22 (2.7%) NA 517 (62.3%)

208 kb 1183 643 (54.4%) 23 (1.9%) NA 643 (54.4%)

425 kb 1543 709 (45.9%) 23 (1.5%) NA 709 (45.9%)

SUPPLEMENTARY TABLE 12. Confirmatory comparison for continuous traits. Numbers of Knock-

offZoom v2 discoveries at different resolutions (all UK Biobank samples) containing previously reported

associations. Other details are as in Supplementary Table 11.

Page 82: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

51

Found by BOLT-LMM Not found by BOLT-LMM

Resolution Total Catalog Japan FinnGen Any Total Catalog Japan FinnGen Any

cvd

3 kb 22 95.5% NA 50.0% 100.0% 0 NA NA NA NA

20 kb 180 83.9% NA 38.3% 88.9% 59 37.3% NA 20.3% 47.5%

41 kb 212 86.8% NA 42.9% 91.5% 127 44.9% NA 27.6% 56.7%

81 kb 261 87.4% NA 53.3% 92.7% 305 41.0% NA 36.7% 59.0%

208 kb 274 90.5% NA 71.9% 97.1% 666 41.4% NA 57.7% 70.9%

425 kb 255 94.9% NA 84.7% 99.2% 834 51.4% NA 74.5% 85.6%

diabetes

3 kb 20 95.0% 65.0% 40.0% 95.0% 1 100.0% 0.0% 0.0% 100.0%

20 kb 45 95.6% 53.3% 40.0% 95.6% 16 68.8% 12.5% 0.0% 68.8%

41 kb 54 96.3% 50.0% 46.3% 96.3% 55 65.5% 16.4% 9.1% 65.5%

81 kb 50 100.0% 54.0% 56.0% 100.0% 59 64.4% 20.3% 13.6% 66.1%

208 kb 52 98.1% 61.5% 61.5% 98.1% 61 72.1% 27.9% 18.0% 75.4%

425 kb 57 98.2% 61.4% 63.2% 98.2% 137 61.3% 16.8% 16.8% 62.8%

hypothyroidism

single-SNP 19 36.8% NA 15.8% 36.8% 0 NA NA NA NA

3 kb 40 57.5% NA 35.0% 60.0% 0 NA NA NA NA

20 kb 89 76.4% NA 22.5% 76.4% 16 18.8% NA 0.0% 18.8%

41 kb 128 64.1% NA 18.0% 65.6% 94 20.2% NA 4.3% 22.3%

81 kb 133 75.2% NA 22.6% 78.9% 144 18.1% NA 5.6% 20.8%

208 kb 129 85.3% NA 25.6% 87.6% 166 18.7% NA 10.2% 25.9%

425 kb 122 88.5% NA 32.8% 93.4% 213 14.6% NA 16.0% 28.2%

respiratory

20 kb 60 98.3% NA 48.3% 98.3% 23 65.2% NA 26.1% 73.9%

41 kb 74 100.0% NA 51.4% 100.0% 49 73.5% NA 40.8% 81.6%

81 kb 83 98.8% NA 65.1% 100.0% 110 66.4% NA 55.5% 82.7%

208 kb 82 98.8% NA 79.3% 100.0% 180 63.3% NA 76.1% 88.3%

425 kb 82 96.3% NA 92.7% 100.0% 301 61.1% NA 84.4% 91.4%

SUPPLEMENTARY TABLE 13. Comparison with other studies and LMM (binary). Numbers of

KnockoffZoom v2 discoveries containing previously reported associations. The results are stratified based

on whether they are also detected by BOLT-LMM (as in Supplementary Table 7). Other details are as in

Supplementary Table 11.

Page 83: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

52

Found by BOLT-LMM Not found by BOLT-LMM

Phenotype Resolution Total Catalog Japan FinnGen Any Total Catalog Japan FinnGen Any

bmi 3 kb 10 100.0% 40.0% NA 100.0% 0 NA NA NA NA

20 kb 309 94.2% 10.4% NA 94.5% 34 47.1% 0.0% NA 47.1%

41 kb 618 89.3% 8.3% NA 89.5% 300 34.3% 0.7% NA 34.3%

81 kb 792 85.2% 6.4% NA 85.2% 688 27.6% 0.6% NA 27.6%

208 kb 898 82.5% 6.2% NA 82.5% 1497 22.4% 0.5% NA 22.4%

425 kb 794 85.8% 7.6% NA 85.9% 1666 24.5% 0.5% NA 24.5%

height single-SNP 95 66.3% 60.0% NA 85.3% 0 NA NA NA NA

3 kb 570 62.6% 45.3% NA 73.2% 0 NA NA NA NA

20 kb 1469 69.8% 32.8% NA 74.5% 34 20.6% 2.9% NA 20.6%

41 kb 2167 68.7% 26.2% NA 72.0% 217 20.7% 2.3% NA 21.2%

81 kb 2417 71.0% 24.0% NA 73.1% 589 18.2% 1.5% NA 18.8%

208 kb 2228 76.1% 24.7% NA 77.3% 1111 14.5% 1.0% NA 14.8%

425 kb 1804 81.3% 26.6% NA 82.0% 1269 14.7% 1.1% NA 14.9%

platelet single-SNP 53 69.8% 41.5% NA 77.4% 0 NA NA NA NA

3 kb 245 62.4% 29.4% NA 68.6% 1 0.0% 0.0% NA 0.0%

20 kb 900 38.3% 10.8% NA 40.8% 102 6.9% 0.0% NA 6.9%

41 kb 1041 36.8% 9.3% NA 38.5% 220 3.6% 0.0% NA 3.6%

81 kb 1120 36.6% 8.0% NA 37.5% 450 3.6% 0.2% NA 3.6%

208 kb 1057 39.4% 8.5% NA 40.1% 686 4.2% 0.6% NA 4.2%

425 kb 911 42.9% 9.0% NA 43.4% 742 4.6% 0.5% NA 4.6%

sbp 3 kb 83 83.1% 12.0% NA 83.1% 0 NA NA NA NA

20 kb 177 89.3% 9.6% NA 89.3% 14 57.1% 0.0% NA 57.1%

41 kb 366 86.1% 6.0% NA 86.3% 145 29.7% 0.0% NA 29.7%

81 kb 496 86.5% 4.4% NA 86.5% 334 26.3% 0.0% NA 26.3%

208 kb 561 87.2% 4.1% NA 87.2% 622 24.8% 0.0% NA 24.8%

425 kb 538 90.0% 4.3% NA 90.0% 1005 22.4% 0.0% NA 22.4%

SUPPLEMENTARY TABLE 14. Comparison with other studies and LMM (continuous). Numbers

of KnockoffZoom v2 discoveries containing previously reported associations. The results are stratified based

on whether they are also detected by BOLT-LMM (as in Supplementary Table 8). Other details are as in

Supplementary Table 12.

Page 84: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

53

Phenotype Catalog Japan FinnGen

bmi 4261 / 4514 (94.4%) 5016 / 5094 (98.5%) NA

cvd 2223 / 4229 (52.6%) NA 2491 / 6713 (37.1%)

diabetes 709 / 1906 (37.2%) 5904 / 8550 (69.1%) 93 / 577 (16.1%)

height 4324 / 4461 (96.9%) 61730 / 63254 (97.6%) NA

hypothyroidism 176 / 197 (89.3%) NA 89 / 462 (19.3%)

platelet 1121 / 1159 (96.7%) 7797 / 8012 (97.3%) NA

respiratory 1751 / 4112 (42.6%) NA 1129 / 9450 (11.9%)

sbp 1781 / 2048 (87.0%) 1757 / 1817 (96.7%) NA

SUPPLEMENTARY TABLE 15. Estimated power based on other studies. Total numbers of reported

associations in the GWAS Catalog, Japan Biobank resource, or FinnGen resource, along with the corre-

sponding fraction confirmed in our low-resolution analysis (425 kb). Other details are as in Supplementary

Tables 11–12.

Page 85: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

54

Total Not found by BOLT-LMM

Confirmed Confirmed

Resolution Discoveries Other Other or Enrich. Discoveries Other Other or Enrich.

cvd

3 kb 22 22 (100.0%) 22 (100.0%) 0 NA NA

20 kb 239 188 (78.7%) 219 (91.6%) 59 28 (47.5%) 50 (84.7%)

41 kb 339 266 (78.5%) 309 (91.2%) 127 72 (56.7%) 107 (84.3%)

81 kb 566 422 (74.6%) 495 (87.5%) 305 180 (59.0%) 240 (78.7%)

208 kb 940 738 (78.5%) 764 (81.3%) 666 472 (70.9%) 493 (74.0%)

425 kb 1089 967 (88.8%) 968 (88.9%) 834 714 (85.6%) 715 (85.7%)

diabetes

3 kb 21 20 (95.2%) 20 (95.2%) 1 1 (100.0%) NA

20 kb 61 54 (88.5%) 57 (93.4%) 16 11 (68.8%) 13 (81.2%)

41 kb 109 88 (80.7%) 97 (89.0%) 55 36 (65.5%) 42 (76.4%)

81 kb 109 89 (81.7%) 99 (90.8%) 59 39 (66.1%) 48 (81.4%)

208 kb 113 97 (85.8%) 106 (93.8%) 61 46 (75.4%) 54 (88.5%)

425 kb 194 142 (73.2%) 157 (80.9%) 137 86 (62.8%) 100 (73.0%)

hypothyroidism

single-SNP 19 7 (36.8%) 7 (36.8%) 0 NA NA

3 kb 40 24 (60.0%) 24 (60.0%) 0 NA NA

20 kb 105 71 (67.6%) 91 (86.7%) 16 3 (18.8%) 8 (50.0%)

41 kb 222 105 (47.3%) 172 (77.5%) 94 21 (22.3%) 61 (64.9%)

81 kb 277 135 (48.7%) 219 (79.1%) 144 30 (20.8%) 93 (64.6%)

208 kb 295 156 (52.9%) 226 (76.6%) 166 43 (25.9%) 101 (60.8%)

425 kb 335 174 (51.9%) 231 (69.0%) 213 60 (28.2%) 116 (54.5%)

SUPPLEMENTARY TABLE 16. Enrichment analysis with independent GWAS (binary). Numbers

of KnockoffZoom v2 discoveries confirmed by other studies or enrichment analysis using independent GWAS

summary statistics. Enrichment results are estimates. The results are stratified based on whether they are

also detected by BOLT-LMM (as in Supplementary Table 13).

Page 86: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

55

Total Not found by BOLT-LMM

Confirmed Confirmed

Phenotype Resolution Discover. Other Other or Enrich. Discover. Other Other or Enrich.

bmi 3 kb 10 10 (100.0%) 10 (100.0%) 0 NA NA

20 kb 343 308 (89.8%) 328 (95.6%) 34 16 (47.1%) 29 (85.3%)

41 kb 918 656 (71.5%) 821 (89.4%) 300 103 (34.3%) 234 (78.0%)

81 kb 1480 865 (58.4%) 1182 (79.9%) 688 190 (27.6%) 450 (65.4%)

208 kb 2395 1076 (44.9%) 1620 (67.6%) 1497 335 (22.4%) 806 (53.8%)

425 kb 2460 1091 (44.3%) 1567 (63.7%) 1666 409 (24.5%) 820 (49.2%)

height single-SNP 95 81 (85.3%) 81 (85.3%) 0 NA NA

3 kb 570 417 (73.2%) 417 (73.2%) 0 NA NA

20 kb 1503 1102 (73.3%) 1351 (89.9%) 34 7 (20.6%) 20 (58.8%)

41 kb 2384 1607 (67.4%) 1997 (83.8%) 217 46 (21.2%) 111 (51.2%)

81 kb 3006 1879 (62.5%) 2386 (79.4%) 589 111 (18.8%) 314 (53.3%)

208 kb 3339 1886 (56.5%) 2493 (74.7%) 1111 164 (14.8%) 556 (50.0%)

425 kb 3073 1669 (54.3%) 2231 (72.6%) 1269 189 (14.9%) 622 (49.0%)

platelet single-SNP 53 41 (77.4%) 41 (77.4%) 0 NA NA

3 kb 246 168 (68.3%) 230 (93.5%) 1 0 (0.0%) 0 (0.0%)

20 kb 1002 374 (37.3%) 778 (77.6%) 102 7 (6.9%) 49 (48.0%)

41 kb 1261 409 (32.4%) 934 (74.1%) 220 8 (3.6%) 127 (57.7%)

81 kb 1570 436 (27.8%) 1058 (67.4%) 450 16 (3.6%) 226 (50.2%)

208 kb 1743 453 (26.0%) 1017 (58.3%) 686 29 (4.2%) 256 (37.3%)

425 kb 1653 429 (26.0%) 922 (55.8%) 742 34 (4.6%) 297 (40.0%)

sbp 3 kb 83 69 (83.1%) 69 (83.1%) 0 NA NA

20 kb 191 166 (86.9%) 178 (93.2%) 14 8 (57.1%) 12 (85.7%)

41 kb 511 359 (70.3%) 441 (86.3%) 145 43 (29.7%) 97 (66.9%)

81 kb 830 517 (62.3%) 663 (79.9%) 334 88 (26.3%) 200 (59.9%)

208 kb 1183 643 (54.4%) 885 (74.8%) 622 154 (24.8%) 358 (57.6%)

425 kb 1543 709 (45.9%) 983 (63.7%) 1005 225 (22.4%) 474 (47.2%)

SUPPLEMENTARY TABLE 17. Enrichment analysis with independent GWAS (continuous).

Numbers of KnockoffZoom v2 discoveries confirmed by other studies or enrichment analysis using indepen-

dent GWAS summary statistics. Other details are as in Supplementary Table 16.

Page 87: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

56

Total Not found by BOLT-LMM

Resolution Input Confirmed Input Confirmed

cvd

20 kb 51 23–40 (45%–78%) 31 15–27 (48%–87%)

41 kb 73 33–53 (45%–73%) 55 27–43 (49%–78%)

81 kb 144 57–88 (40%–61%) 125 45–74 (36%–59%)

208 kb 202 8–47 (4%–23%) 194 5–40 (3%–21%)

425 kb 122 0–7 (0%–6%) 120 0–5 (0%–4%)

diabetes

3 kb 1 1–1 (100%–100%) 0 NA

20 kb 7 0–5 (0%–71%) 5 0–5 (0%–100%)

41 kb 21 3–14 (14%–67%) 18 1–11 (6%–61%)

81 kb 20 6–15 (30%–75%) 19 5–14 (26%–74%)

208 kb 16 4–14 (25%–88%) 15 3–12 (20%–80%)

425 kb 52 6–27 (12%–52%) 51 4–25 (8%–49%)

hypothyroidism

single-SNP 12 12–12 (100%–100%) 0 NA

3 kb 16 11–16 (69%–100%) 0 NA

20 kb 34 13–26 (38%–76%) 13 2–9 (15%–69%)

41 kb 117 53–81 (45%–69%) 73 30–51 (41%–70%)

81 kb 142 69–98 (49%–69%) 114 49–76 (43%–67%)

208 kb 139 55–86 (40%–62%) 123 43–73 (35%–59%)

425 kb 161 41–74 (25%–46%) 153 40–73 (26%–48%)

SUPPLEMENTARY TABLE 18. Details of enrichment analysis (binary). Bootstrap confidence inter-

vals (90%) for the proportion of novel KnockoffZoom v2 discoveries confirmed by the enrichment analysis

in Supplementary Table 16.

Page 88: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

57

Total Not found by BOLT-LMM

Resolution Input Confirmed Input Confirmed

bmi

20 kb 35 13–27 (37%–77%) 18 8–18 (44%–100%)

41 kb 262 146–184 (56%–70%) 197 115–147 (58%–75%)

81 kb 615 284–350 (46%–57%) 498 231–289 (46%–58%)

208 kb 1319 494–595 (37%–45%) 1162 422–518 (36%–45%)

425 kb 1369 424–529 (31%–39%) 1257 361–461 (29%–37%)

height

single-SNP 14 0–9 (0%–64%) 0 NA

3 kb 153 90–123 (59%–80%) 0 NA

20 kb 401 225–272 (56%–68%) 27 7–19 (26%–70%)

41 kb 777 353–426 (45%–55%) 171 47–84 (27%–49%)

81 kb 1127 460–552 (41%–49%) 478 174–234 (36%–49%)

208 kb 1453 555–660 (38%–45%) 947 349–434 (37%–46%)

425 kb 1404 509–615 (36%–44%) 1080 387–478 (36%–44%)

platelet

single-SNP 12 3–12 (25%–100%) 0 NA

3 kb 78 53–70 (68%–90%) 1 0–0 (0%–0%)

20 kb 628 373–433 (59%–69%) 95 29–55 (31%–58%)

41 kb 852 488–561 (57%–66%) 212 100–138 (47%–65%)

81 kb 1134 578–665 (51%–59%) 434 181–238 (42%–55%)

208 kb 1290 514–614 (40%–48%) 657 190–264 (29%–40%)

425 kb 1224 442–542 (36%–44%) 708 224–301 (32%–43%)

sbp

3 kb 14 3–12 (21%–86%) 0 NA

20 kb 25 5–18 (20%–72%) 6 2–6 (33%–100%)

41 kb 152 67–97 (44%–64%) 102 40–67 (39%–66%)

81 kb 313 122–169 (39%–54%) 246 90–133 (37%–54%)

208 kb 540 209–273 (39%–51%) 468 173–233 (37%–50%)

425 kb 834 232–316 (28%–38%) 780 209–289 (27%–37%)

SUPPLEMENTARY TABLE 19. Details of enrichment analysis (continuous). Bootstrap confidence

intervals (90%) for the proportion of novel KnockoffZoom v2 discoveries confirmed by the enrichment analysis

in Supplementary Table 17.

Page 89: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

58

Phenotype Discoveries Contains geneKnown lead SNP

consequence

Known lead SNP

association

cvd 31 26 (84%) 28 (90%) 21 (68%)

diabetes 5 5 (100%) 3 (60%) 5 (100%)

hypothyroidism 13 12 (92%) 8 (62%) 9 (69%)

respiratory 6 5 (83%) 4 (67%) 3 (50%)

SUPPLEMENTARY TABLE 20. Functional follow-up on novel unconfirmed discoveries. Numbers

of novel discoveries (not found by BOLT-LMM and not confirmed by the other studies in Supplementary

Table 13) that either contain a gene or whose lead SNP has a known functional annotation or a known

association with phenotypes closely related to that of interest.

Phenotype Associations

cvd NA (10), blood pressure (9), BMI (8), obesity (3), cardiovascular disease (1),

CCL2 (1), cholesterol (1), triglycerides (1), heart rate (1)

diabetes diabetes (3), Factor VII (1), glyburide metabolism (1)

hypothyroidism NA (4), autoimmune thyroid disease (2), psoriasis (2), diabetic nephropathy (1),

Graves disease (1), hypothyroidism (1), rheumatoid arthritis (1), thyroid function (1)

respiratory NA (3), hypersomnia (1), interaction with air pollution (1), serum IgE (1)

SUPPLEMENTARY TABLE 21. Known associations of novel discoveries to related phenotypes.

Associations of our novel discoveries (20 kb resolution) in Supplementary Table 20 to related traits. The

same discovery may have more than one relevant association in this table.

Page 90: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

59

Consequence cvd diabetes hypothyroidism respiratory

2KB Upstream 1

3 Prime UTR 2

500B Downstream 1

Intron 19 2 6 3

Missense 3 1

Non coding transcript exon 1

Regulatory region 2

Stop gained 1

Tf binding site 1

Unknown 3 2 5 2

Total 31 5 13 6

SUPPLEMENTARY TABLE 22. Known consequences of lead variants. Numbers of lead variants with

known consequences for our novel discoveries (20 kb resolution) in Supplementary Table 20.

Page 91: Controlling the false discovery rate in GWAS with ...candes/publications/downloads/sesia2… · Controlling the false discovery rate in GWAS with population structure Matteo Sesia

60

1 Yedidia, J., Freeman, W. & Weiss, Y. Understanding belief propagation and its generalizations. In

Exploring Artificial Intelligence in the New Millenium, vol. 8, 239–269 (Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 2003).

2 Wainwright, M. J. & Jordan, M. I. Graphical models, exponential families, and variational inference

(Now Publishers Inc, 2008).

3 Sesia, M., Sabatti, C. & Candes, E. Gene hunting with hidden Markov model knockoffs. Biometrika

106, 1–18 (2019).

4 Sesia, M., Katsevich, E., Bates, S., Candes, E. & Sabatti, C. Multi-resolution localization of causal

variants across the genome. Nat. Comm. 11, 1093 (2020).

5 Bates, S., Candes, E., Janson, L. & Wang, W. Metropolized knockoff sampling. J. Am. Stat. Assoc.

1–25 (2020).

6 Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype

data. Genetics 155, 945–959 (2000).

7 Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype

data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).

8 Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data:

applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644

(2006).

9 Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: Model-X knockoffs for high-dimensional

controlled variable selection. J. R. Stat. Soc. B. 80, 551–577 (2018).

10 Klasen, J. R. et al. A multi-marker association method for genome-wide association studies without the

need for population structure correction. Nat. Commun. 7 (2016).

11 Japan, B. Biobank Japan Project (2020). URL http://jenger.riken.jp/en/.

12 FinnGen. FinnGen documentation of r3 release (2020). URL https://finngen.gitbook.io/

documentation/.

13 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci.

U.S.A 100, 9440–9445 (2003).

14 Klaus, B., Strimmer, K. & Strimmer, M. K. Package ‘fdrtool‘ .

15 Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-

scale datasets. Nat. Genet. 50, 906–908 (2018).