nature genetics: doi:10.1038/ng · by a gcta-cojo analysis6, we selected the top gwas snp,...

20
Supplementary Figure 1 QQ plots of P values from the SMR tests under a range of simulation scenarios. The simulation strategy is described in the Supplementary Note. Shown are the results from 10,000 simulations. MR one sample, Mendelian randomization analysis in one sample; SMR one sample, summary data–based Mendelian randomization analysis in one sample; SMR two samples, the effect sizes of a SNP on gene expression (βzx) and a SNP on phenotype (bzy) are estimated from two independent samples. Nature Genetics: doi:10.1038/ng.3538

Upload: others

Post on 13-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 1

QQ plots of P values from the SMR tests under a range of simulation scenarios.

The simulation strategy is described in the Supplementary Note. Shown are the results from 10,000 simulations. MR one sample,Mendelian randomization analysis in one sample; SMR one sample, summary data–based Mendelian randomization analysis in one sample; SMR two samples, the effect sizes of a SNP on gene expression (βzx) and a SNP on phenotype (bzy) are estimated from two independent samples.

Nature Genetics: doi:10.1038/ng.3538

Page 2: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 2

Prioritizing genes at the NMRAL1 locus for schizophrenia.

(a) Top, P values from the GWAS for schizophrenia1 (brown dots) and P values from SMR tests (diamonds) using Westra eQTL data2.Bottom, P values from the Westra and Ramasamy eQTL studies3 for NMRALI1. Shown in a are all the SNPs available in the GWAS and eQTL data. (b,c) Effect sizes of the SNPs (used for the HEIDI test) from the GWAS against those from the Westra and Ramasamy eQTL studies. The orange dashed lines represent the estimate of bxy at the top cis-eQTL (rather than the regression line). Error bars are the standard errors of SNP effects.

Nature Genetics: doi:10.1038/ng.3538

Page 3: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 3

Genes with multiple tagging probes that pass the SMR and HEIDI tests—the ANKRD55 locus for rheumatoid arthritis.

(a) Top, dots in brown represent P values for SNPs from the latest GWAS for rheumatoid arthritis4, diamonds represent P values for probes from the SMR analysis and triangles represent probes without a cis-eQTL at PeQTL < 5.0 × 10–8. Probes that passed the HEIDI test (PHEIDI ≥ 0.05) are highlighted in red. Bottom, P values from eQTL studies for the two probes that both tag ANKRD55. Shown in aare all SNPs available in the GWAS and eQTL summary data (rather than those in common). (b,c) Effect sizes of the SNPs (used inthe HEIDI test) from the GWAS against those from the eQTL study for the two probes. The orange dashed lines represent the estimate of bxy at the top cis-eQTL. Error bars are the standard errors of SNP effects. (d) Plot of expression levels of the two probes against each other in the data (328 unrelated individuals) from the Brisbane Systems Genetics Study5.

Nature Genetics: doi:10.1038/ng.3538

Page 4: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 4

Power of detecting heterogeneity in bxy as a function of LD between two causal variants in simulations.

We simulated data under the alternative hypothesis that there are two distinct causal variants, one affecting gene expression and one affecting the trait. The method for simulation is described in the Supplementary Note.

Nature Genetics: doi:10.1038/ng.3538

Page 5: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 5

Distribution of LD between the top associated GWAS SNPs and the simulated causal eQTLs.

The results are from an SMR analysis of real GWAS data for five complex traits (Supplementary Table 6) and simulated eQTL data. The eQTL data were simulated on the basis of real genotypes, mimicking the Westra study but with the causal variants of cis-eQTLs randomly placed across the genome (see the Supplementary Note for details). There were a number of simulated probes that passedthe SMR and HEIDI tests (Supplementary Table 14). For these probes, the overlaps between the GWAS and eQTL signals are due to the strong LD between the GWAS causal variants and the causal eQTLs. Unfortunately, the GWAS causal variants are unknown.Shown are LD r2 values for the top associated GWAS SNPs and the simulated causal eQTLs in these probe regions, which are likely tobe underestimations of LD between the causal variants. For regions where there were multiple independent GWAS signals as indicatedby a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signalwas defined as the top associated GWAS SNP in the region when conditioning on the GWAS SNP used in the SMR test).

Nature Genetics: doi:10.1038/ng.3538

Page 6: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 6

Estimating SNP effects from z statistics.

The reported GWAS effect sizes are equivalent to the SNP effects reported in the GWAS summary data. The estimated GWAS effect sizes are equivalent to the SNP effects estimated from z statistics using the sample size of the whole study rather than the per-SNP sample size reported in the summary data (Supplementary Note). This is just a demonstration of the difference between the reported estimate of effect size and the effect size estimated from z statistics without taking the per-SNP sample size into account. In our SMR analyses for height, we used the reported estimate of effect size.

Nature Genetics: doi:10.1038/ng.3538

Page 7: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

Supplementary Figure 7

An example of pleiotropic signal being diluted by multiple associated SNPs in GWAS.

(a) Effect sizes of the SNPs (used in the HEIDI test) from the latest GWAS for height7 against those from the Westra eQTL study2 at the SPI1 locus. We detected a secondary signal in the GWAS data. (b) Effect sizes conditioning on the secondary signal. The orangedashed lines represent the estimate of bxy at the top cis-eQTL. Error bars are the standard errors of SNP effects.

Nature Genetics: doi:10.1038/ng.3538

Page 8: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

8

Supplementary Note

1. Simulation strategy

We conducted a series of simulations to compare SMR with MR under two assumptions, causality

and pleiotropy.

Direct causality model: We simulated the genotype of a SNP (z) from a binomial distribution, w ~

B(2, p), with p being the allele frequency of the SNP, p ~ U(0.05, 0.5). The SNP genotypes were

then standardized as z = (w− 2 p) / 2 p(1− p) . We simulated the gene expression based on the

model x = zbzx + c+ ezx , where c is a latent confounding variable with c ~ N(0,σ c2 ) ;

σ c2 = var(zbzx )Rc

2 Rzx2 ; Rc

2 is the proportion of variance in x explained by c, Rc2 ~U (0,0.5) ; Rzx

2 is

the proportion of variance in x explained by z, Rzx2 = 0.05, 0.1 or 0.2;

ezx ~ N (0,var(wbzx )(1 / Rzx2 −1)−σ c

2 ) . We then standardized x with mean 0 and variance 1. We

simulated the trait based on the model y = xbxy + c+ exy , where exy ~ N[0,var(Xbxy )(1 / Rxy2 −1)−σ c

2 ] ;

Rxy2 is proportion of variance in y explained by x, Rxy

2 = 0.0, 0.01 or 0.05. We also standardized y

with mean 0 and variance 1. It is notable that the magnitude of bzx or bxy is not important because

they are scaled according to a fixed proportion of variance explained. We simulated 1,000

individuals for dataset 1, and simulated another 10,000 individuals for dataset 2. We repeated the

simulation for 10,000 times.

Direct pleiotropy model: We simulated the data using the same setting as above but generating the

trait phenotype based on the model y = zbzy + c+ ezy where ezy ~ N 0,var(zbzy ) / (1 / Rzy2 −1)−σ c

2"#

$%

and Rzy2 is proportion of variance in y explained by z with rzy

2 = 0.0, 0.005 or 0.01.

Indirect causality model: We simulated the genotype of a SNP using the same method as described

above. The latent variable c was simulated based on the model, c = zbzc + ezc where ezc is generated

from a normal distribution ezc ~ N (0,σ zc2 ) , σ zc

2 = var(zbzc )(1 / Rzc2 −1) , and Rzc

2 is the proportion of

Nature Genetics: doi:10.1038/ng.3538

Page 9: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

9

variance in c explained by z ( Rzc2 = 0.6 ). We standardized c and generated x based on the model

x = cbcx + ecx , where ecx ~ N (0,var(cbcx )(1 / Rcx2 −1)) and Rcx

2 is the proportion of variance in x

explained by c ( Rcx2 = 0.05, 0.1 and 0.2). We then standardized z and generated y based on the model

y = xbxy + exy where exy ~ N (0,var(xbxy )(1 / Rxy2 −1)) and Rxy

2 =0, 0.01 and 0.05. We standardized y

with mean 0 and variance 1. We simulated 1,000 individuals for dataset 1, and 10,000 individuals

for dataset 2. We repeated the simulation for 10,000 times.

Indirect pleiotropy model: We simulated z and x using the same method as in the indirect causality

model but generating the trait based on the model y = cbcy + ecy , where

ecy ~ N (0,var(cbcy )(1 / Rcy2 −1)) and Rcy

2 is the proportion of variance in x explained by c ( Rcy2 = 0,

0.005 and 0.01).

In each simulation replicate, we estimated bzx and bzy by a linear regression analysis, and

performed the MR and SMR analyses following the methods described in the main text (Online

Methods). To investigate whether bxy estimated from the SMR method is consistent with the effect

on x of y due to genetic factor, we estimated b(y~gx) by a linear regression analysis of y on the

genetic value of x (gx).

2. Estimating SNP effect from z-statistic

In an association analysis, if we assume the phenotype (y) has been standardized to mean 0 and

variance 1, the standard error (SE) of the estimate of SNP effect can be calculated as

SE = σ e2

nvar(w)=1− 2p(1− p)b2

2p(1− p)n

where σ e2 is the residual variance of regressing y on w, w is the genotype variable of a SNP (coded

as 0, 1 or 2), b is the SNP effect on trait, p is the minor allele frequency (MAF) of the SNP, and n is

Nature Genetics: doi:10.1038/ng.3538

Page 10: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

10

the sample size. We know that the z-statistic can be calculated as z = b̂ / SE . We therefore have

b̂ = z / 2p(1− p)(n+ z2 ) and SE =1 / 2 p(1− p)(n+ z2 ) .

3. Sampling covariance between b̂xy estimated at two SNPs in LD

The sampling covariance between the bxy estimates at two correlated SNPs (SNP i and j) can be

written as

cov(b̂xy (i ) ,b̂xy ( j ) ) = E(b̂xy (i )b̂xy ( j ) )− E(b̂xy (i ) )E(b̂xy ( j ) )

= E(b̂zy (i )b̂zy ( j )β̂zx(i )β̂zx( j )

)− E(b̂zy (i )β̂zx(i )

)E(b̂zy ( j )β̂zx( j )

)

Using the Delta method8, we get

E(b̂zy (i )b̂zy ( j )β̂zx(i )β̂zx( j )

) ≈bzy (i )bzy ( j )βzx(i )βzx( j )

+bzy (i )bzy ( j )β 3zx(i )βzx( j )

var(β̂zx(i ) )+bzy (i )bzy ( j )βzx(i )β

3zx( j )

var(β̂zx( j ) )

+cov(b̂zy (i ) ,b̂zy ( j ) )

βzx(i )βzx( j )+bzy (i )bzy ( j )β 2zx(i )β

2zx( j )

cov(β̂zx(i ) , β̂zx( j ) )

E(b̂zyβ̂zx) =bzyβzx

1+ var(β̂zx )β 2zx

−cov(b̂zy , β̂zx )

bzyβzx

"

#$$

%

&''=bzyβzx

1+ 1χ zx2−cov(b̂zy , β̂zx )bzyβzx

"

#$$

%

&'' , where χ zx

2 is the chi-

squared statistic to test for the effect of z on x. Under the null, cov(b̂zy , β̂zx ) = 0 in particular if they

are estimated from two independent studies. Since z is selected to be strongly associated with x,

1+ 1χ zx2≈1 . We get E(

b̂zyβ̂zx) ≈bzyβzx

and E(b̂zy (i )β̂zx(i )

)E(b̂zy ( j )β̂zx( j )

) ≈bzy (i )bzy ( j )βzx(i )βzx( j )

.

We therefore have

cov(b̂xy (i ) ,b̂xy ( j ) ) ≈cov(b̂zy (i ) ,b̂ zy ( j ) )

βzx(i )βzx( j )+bzy (i )b zy ( j )β 2zx(i )β

2zx( j )

cov(β̂zx(i ) , β̂zx( j ) )

−bzy (i )b zy ( j )β 3zx(i )β

3zx( j )

var(β̂zx(i ) ) var(β̂zx( j ) )

Nature Genetics: doi:10.1038/ng.3538

Page 11: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

11

We know that the sampling correlation between the estimates of SNP effects equals to the LD

correlation between the SNPs. We then get

cov(b̂xy (i ) ,b̂xy ( j ) ) ≈r

βzx(i )βzx( j )var(b̂zy (i ) ) var(b̂ zy ( j ) ) +bxy (i )bxy ( j )

rzzx(i )zzx( j )

−1

z2zx(i )z2zx( j )

#

$%%

&

'((

where zzx = β̂zx / var(β̂zx ) .

4. Test of heterogeneity in bxy estimated at two linked SNPs for two probes

Following the derivation above (Supplementary Note #3), the sampling covariance between the

estimates of bxy for two probes (x1 and x2) can be expressed as

cov(b̂xy (1i ) ,b̂xy (2 j ) ) ≈cov(b̂zy (1i ) ,b̂ zy (2 j ) )

βzx(1i )βzx(2 j )+bzy (1i )b zy (2 j )β 2zx(1i )β

2zx(2 j )

cov(β̂zx(1i ) , β̂zx(2 j ) )

−bzy (1i )b zy (2 j )β 3zx(1i )β

3zx(2 j )

var(β̂zx(1i ) ) var(β̂zx(2 j ) )

where the subscripts i and j represent two SNPs, and cov(β̂zx(1i ) , β̂zx(2 j ) ) is the sampling covariance

between the estimates of effect sizes of two eQTLs for two probes (see Supplementary Note #3 for

the definitions of the other parameters).

The least squares (LS) estimates of two eQTL effects can be expressed as

β̂zx(1i ) = !zix1 !zizi( ) and β̂zx(2 j ) = !z jx2 !z jz j( ) , with sampling variance var(β̂zx(1i ) ) = var(x1) !zizi( )

and var(β̂zx(2 j ) ) = var(x2 ) !z jz j( ) , where z is a vector of standardized SNP genotypes and x is a

vector of standardized gene expression levels. We therefore have the sampling covariance between

two LS estimates as

cov(β̂zx(1i ) , β̂zx(2 j ) ) = cov !zix1, !z jx2( ) !zizi !z jz j( ) = !ziz j cov x1,x2( ) !zizi !z jz j( )

= rijrx var(β̂zx(1i ) ) var(β̂zx(2 j ) )

where rij is the LD correlation between SNPs, and rx is the phenotypic correlation between probes.

Nature Genetics: doi:10.1038/ng.3538

Page 12: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

12

We therefore can have a heterogeneity test-statistic as (β̂zx(1i ) − β̂zx(2 j ) )2 / var(β̂zx(1i ) − β̂zx(2 j ) )

which follows χ12 under the null hypothesis that there is no difference, where

var(β̂zx(1i ) − β̂zx(2 j ) ) = var(β̂zx(1i ) )+ var(β̂zx(2 j ) )− 2cov(β̂zx(1i ) , β̂zx(2 j ) ) .

5. Mean distance of probes and cis-eQTLs to the transcription end sites

We identified 9 genes for which there was a significant difference in bxy between the tagging probes

(Supplementary Table 10). We then tested whether the secondary probes of these genes (9 probes

that did not passed the SMR or HEIDI test) tend to be located near the transcription end sites (TES).

We calculated mean distance of the probes (as well as their top associated cis-eQTLs) to TES,

which was then compared with what one would expect if probes (or cis-eQTLs) were sampled at

random. The probes were sampled from all genes with matched gene lengths (genes were classified

into bins by length with 2Kb intervals), and the cis-eQTLs were sampled from all the cis-eQTLs

with matched MAFs (cis-eQTLs were binned by MAF with 10 intervals). The whole process was

repeated 10,000 times.

6. A summary description of the methods for detecting gene-trait association using GWAS

and eQTL data

There are other methods for detecting gene-trait associations using GWAS and gene expression

data. For example, prior to the GWAS era, Schadt and colleagues9,10 conceptualized an integrative

analysis of genetic and eQTL data for fine mapping and identification of drug targets. Nicolae et

al.11 showed that trait-associated variants identified from GWAS were enriched for eQTLs and

proposed to use this information to identify novel genetic risk factors for diseases. COLOC12-14 is a

Bayesian analysis approach that uses GWAS and eQTL P values to assess whether two association

signals (one from GWAS and one from eQTL study) are consistent with a single shared causal

variant. This is a very useful tool to detect colocalization of GWAS and eQTL signals at known

Nature Genetics: doi:10.1038/ng.3538

Page 13: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

13

GWAS loci, and to distinguish between models (for example, one causal variant vs. two causal

variants). COLOC does not provide thresholds for the posterior probabilities to control for genome-

wide false positive rate (there were multiple thresholds used in Giambartolomei, et al.13), which

makes it challenging to apply the method for genome-wide scan. PrediXcan15 links GWAS and

eQTL data by a polygenic prediction analysis. It requires individual-level genotype and gene

expression data in the training set, and individual-level genotype and phenotype data in the target

set, which limits the power because of the limited sample sizes of such data sets at present. Gusev et

al.16 proposed an elegant method to overcome this problem by performing a polygenic prediction

analysis using summary-level data. Both PrediXcan and the Gusev et al. method do not distinguish

between pleiotropy and linkage.

7. Quantifying the power of detecting heterogeneity in bxy as a function of LD between two

causal variants by simulations

We used simulations to calibrate the extent to which the power of detecting heterogeneity in bXY

decreases with LD between two causal variants. We performed simulations based on the HM2-

imputed ARIC data set. We used 182,567 SNPs on chromosome 1 with minor allele frequency

(MAF) ≥ 0.01 and Hardy-Weinberg Equilibrium (HWE) P value ≥ 1×10–6. In each simulation, we

randomly generated a physical position, and included the SNPs within ±250Kb of the position as

“cis-eQTLs”. We then randomly sampled from the cis-eQTLs two SNPs as causal variants (z1 and

z2), one for gene expression and one for trait. We generated gene expression value based on the

model x = z1bzx + c+ ezx , and trait value based on the model y = z2bzy + c+ ezy , where Rzx2

= 0.2 and

Rzy2 = 0.01 (see above Supplementary Note #1 for the definitions of the parameters and variables).

We sampled the two causal variants with LD from low to high in ten categories, and repeated the

simulation 1,000 times. We analyzed the simulated data (excluding the causal variants) using the

SMR method with the same criteria as used in real data analysis, for example, including only the

Nature Genetics: doi:10.1038/ng.3538

Page 14: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

14

genes with a cis-eQTL at PeQTL < 5×10–8 for SMR test, and performing HEIDI test only for genes

with PSMR < 8.4×10–6. The simulation results are shown in Supplementary Fig. 4.

8. SMR analysis with simulated eQTL data

We performed an analysis to quantify the enrichment of the observed number of probes that passed

the SMR and HEID tests compared with what one would expect if the eQTLs were randomly

placed. We could not simply shuffle the positions of either the GWAS SNPs or the eQTLs because

of LD. We therefore used a simulation-based approach. We simulated eQTL data to mimic the

Westra eQTL study using real genotype data (6,654 individuals and 2,413,012 SNPs with MAF ≥

0.01 from the HapMap 2-imputed ARIC study). We randomly generated a physical position apart

from MHC region as the midpoint of a probe, and defined a ±250Kb window centered on the

position as a cis-eQTL region. We then randomly sampled a SNP from the cis-eQTL region as

causal variant, and generated gene expression level (x) by a simple additive model x = zbzx + c+ ezx

(see Supplementary Note #1 for the definitions of the parameters and variables). To achieve

similar power as in the Westra study, we assigned Rzx2 = χ 2 −1( ) χ 2 −1+ n( ) , where n is the sample

size (n = 6,554), and χ 2 is the chi-squared statistic of the top cis-eQTL for a probe in the Westra

data (the χ 2 values of the top cis-eQTLs included in the SMR analysis were > 29.7, equivalent to

p-values < 5×10–8). We estimated the eQTL effect sizes and standard errors by a linear regression

analysis of the simulated gene expression level on SNPs. We repeated this process and generated

eQTL data for 5,967 probes as used in real data analysis. We then preformed the SMR analyses

using the simulated eQTL data and real-life GWAS summary data for height, BMI, WHRadjBMI,

RA and SCZ (excluding the MHC region as in real data analysis). The whole simulation process

was repeated 100 times.

9. Four reasons why the HEIDI test is over-conservative

Nature Genetics: doi:10.1038/ng.3538

Page 15: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

15

There are a least four reasons for the HEIDI test being too conservative. First, we used a

significance level of 0.05 without correcting for multiple testing. This means that under the null

hypothesis of no heterogeneity (that is, all the associations between trait and gene expression

detected by the initial SMR test are due to pleiotropy), we will eliminate 5% of the real findings.

Secondly, we used summary data in the public domain that inevitably contain some errors, for

example, the effect alleles may be mislabeled, which will be detected as heterogeneity. Thirdly, we

estimated eQTL effect sizes from z-statistics without taking into account the difference in per-SNP

sample size (per-SNP sample size was not available in the Westra eQTL data), which generated

additional variation in the estimates of eQTL effect sizes. Shown in Supplementary Fig. 6 is a

demonstration of the difference between the reported effect size, and the effect size estimated from

z-statistic without taking per-SNP sample size into account using the latest GIANT summary data

for height. Last but not least, multiple association signals at single loci in either GWAS or eQTL

studies will be detected as heterogeneity (see Supplementary Fig. 7 for an example).

10. The interpretations of bxy under different assumptions

The estimate of bxy from the SMR analysis has a direction (that is, positive or negative sign). For a

case-control study, the SMR estimate of bxy using eQTL data from an independent population-based

study (for example, the Westra eQTL data used in this study) is a biased estimate of that if eQTL

data are available in same case-control sample. Therefore, the SMR estimate of bxy for diseases

needs to be interpreted with great caution. For quantitative traits, under the assumption of causality,

bxy can be interpreted as the effect of gene expression on trait free of non-genetic confounders.

Therefore, in the absence of confounders, the estimate of bxy from the SMR analysis should be

consistent with that estimated from a direction regression of trait (y) on gene expression (x). In this

case, the power of detecting bxy in a direct association analysis between x and y is approximately a

function of nRzy2 Rzx

2 for polygenic traits (see Supplementary Note #12 more details), where Rzx2

and Rzy2 are the proportions of variance in trait and gene expression explained by the top cis-QTL,

Nature Genetics: doi:10.1038/ng.3538

Page 16: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

16

respectively. In this scenario (causality), the power to detect an x-y association (TWAS) is higher

than that to detect a z-y association (GWAS) given the same sample size. On the other hand, under

the assumption of pleiotropy, the SMR estimate of bxy does not have an explicit interpretation, and

in the absence of confounders (trait and gene expression are correlated only through a single causal

variant), the power of detecting the direction association between x and y is approximately a

function of nRzy2 Rzx

2 (see Supplementary Note #12 for more details). Under this model (pleiotropy),

the power of detecting an x-y association (TWAS) is smaller than that of detecting a z-y association

(GWAS) given the sample size.

11. Distinguishing causality from pleiotropy

We tested against the hypothesis of causality using the trans-eQTLs in the Westra eQTL data. For

the 104 genes prioritized by the SMR analysis using SNPs in cis-eQTL regions, only 3 of them (2

for height and 1 for BMI) have trans-eQTLs at PeQTL < 5×10–8 (Supplementary Table 15). This is

because the genetic architecture of expression traits are relatively simple (for many gene expression

traits, the majority of heritability is attributed to the cis-eQTL) and because the sample size of the

Westra study is still not large enough to detect the trans-eQTLs whose effect sizes are usually much

smaller than those of cis-eQTLs2. For each of the four genes, there were only two genome-wide

significant trans-eQTLs (PeQTL < 5.0×10–8) that were almost in perfect LD with each other (LD r2 >

0.95). We estimated the bxy at the top trans-eQTL (bxy(top-trans)) using the SMR method, and

compared it with that estimated at the top cis-eQTL (bxy (top-cis)). The difference between bxy (top-cis)

and bxy (top-trans) was tested using the test statistic

TDiff = [b̂xy(top-cis) − b̂xy (top-trans) ]2 / [var(b̂xy (top-cis) )+ var(b̂xy (top-trans) )] , TDiff ~ χ1

2 . There was no covariance

between b̂xy (top-cis) and b̂xy (top-trans) because the trans-eQTLs were all independent from the

corresponding cis-eQTLs (LD r2 < 0.001 in ARIC data). We found that none of them was

significantly different from zero (Supplementary Table 15). Interestingly, the effect size of

FRAP1 on height estimated using the top associated cis-eQTL (bxy (top-cis) = 0.09, s.e. = 0.02) was

Nature Genetics: doi:10.1038/ng.3538

Page 17: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

17

significantly different (PDifference = 6.2×10–5) from that using the top trans-eQTL (bxy (top-trans) = 0.0,

s.e. = 0.01), consistent with a model of pleiotropy rather than causality for this gene.

In addition, for the probes with multiple independent eQTL signals (as implicated by the

GCTA conditional analyses), we tested the difference between the estimates of bxy at the primary

and secondary signals using the test statistic

TDiff = [b̂xy (top) − b̂xy (sec) ]2 / [var(b̂xy (top) )+ var(b̂xy (sec) )− 2cov(b̂xy (top) ,b̂xy (sec) )] where b̂xy (sec) is the estimate

of bxy at the secondary signal using the original eQTL and GWAS summary data, and the variance

and covariance terms can be calculated using the method described above. We found a significant

(correcting for multiple tests) difference for 8 out of 12 genes (Supplementary Table 16),

consistent with the model of pleiotropy. However, the results are not conclusive. The bxy estimates

at all the secondary signals were not significantly different from 0 even at a nominal significance

level (threshold P value = 0.05). The LD correlations between the primary and secondary cis-eQTL

signals were relatively large compared with those reported in previous studies using GWAS data6,7,

suggesting that it is still likely that the two eQTL signals are in LD with a single underlying causal

variant6. If this is the case, we could not reject the linkage model (as in Supplementary Fig. 4 that

if GWAS and eQTL causal variants are in very LD, there is little power to distinguish between

models using only the SNPs in the cis region).

12. Statistical power of detecting a direct association between trait and gene expression under

two assumptions

If we have gene expression (x) and phenotype (y) data available in the same sample, the power, as

measured by the non-centrality (NCP) of the chi-square statistic, of detecting the direct association

between trait and gene expression is NCP = bXY2 / var(b̂XY

2 ) , where bxy is the slope of regressing y

and x based on the model y = xbxy + exy .

Nature Genetics: doi:10.1038/ng.3538

Page 18: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

18

Causality: Under the assumption of causality where the effect size of a genetic variant (z) on

phenotype is mediated by gene expression, bxy2 can be expressed as (assuming both x and y have

been standardized with mean 0 and variance 1)

bxy2 = cov(x, y) var(x)!" #$

2= cov(x,xbxy + exy )!" #$

2= bxy

2 = Rxy2

where Rxy2 is defined as the proportion of variance in y that can be explained by x in the absence of

non-genetic confounders. The sampling variance of b̂xy is

var(b̂xy ) = var(exy ) nvar(x)!" #$= 1−bxy2( ) n = 1− Rxy2( ) n

where n is the sample size. Since bxy = bzy bzx , i.e. Rxy2 = Rzy

2 Rzx2 , we have

NCPcausality = nRxy2 1− Rxy

2( ) = nRzy2 Rzx2 − Rzy

2( ) . For polygenic traits where Rzy2 (proportion of

phenotypic variance explained by a single genetic variant) is small relative to Rzx2 (proportion of

variance in gene expression explained by a genetic variant), we then have NCPcausality ≈ nRzy2 Rzx

2

Pleiotropy: Under the assumption of pleiotropy, a genetic variant has direct effects on both gene

expression and trait phenotype, i.e. x = zbzx + ezx and y = zbzy + ezy , and assuming x and y are

correlated only through z, we have bxy2 = cov(x, y) var(x)!" #$

2= cov(zbzx + ezx , zbzy + ezy )!" #$

2= Rzx

2 Rzy2

The sampling variance is var(b̂xy ) = 1− Rzx2 Rzy

2( ) n . We then get

NCPpleiotropy = nRzx2 Rzy

2 1− Rzx2 Rzy

2( ) . Again, for polygenic traits where the phenotypic variance

explained by an individual variant is small, we have NCPpleiotropy ≈ nRzx2 Rzy

2 .

13. Acknowledgements

The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by

National Heart, Lung, and Blood Institute contracts (HHSN268201100005C,

Nature Genetics: doi:10.1038/ng.3538

Page 19: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

19

HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C,

HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C), R01HL087641,

R01HL59367 and R01HL086694; National HumanGenome Research Institute contract

U01HG004402; and National Institutes of Health contract HHSN268200625226C. The authors

thank the staff and participants of the ARIC study for their important contributions. Infrastructure

was partly supported by Grant Number UL1RR025005, a component of the National Institutes of

Health and NIH Roadmap for Medical Research.

14. References

1. Schizophrenia Working Group of the Psychiatric Genomics, C. Biological insights from 108

schizophrenia-associated genetic loci. Nature 511, 421-7 (2014).

2. Westra, H.J. et al. Systematic identification of trans eQTLs as putative drivers of known

disease associations. Nature Genetics 45, 1238-U195 (2013).

3. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions

of the human brain. Nat Neurosci 17, 1418-28 (2014).

4. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery.

Nature 506, 376-81 (2014).

5. Powell, J.E. et al. The Brisbane Systems Genetics Study: Genetical Genomics Meets

Complex Trait Genetics. Plos One 7(2012).

6. Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics

identifies additional variants influencing complex traits. Nat. Genet. 44, 369-75 (2012).

7. Wood, A.R. et al. Defining the role of common variation in the genomic and biological

architecture of adult human height. Nat Genet 46, 1173-86 (2014).

8. Lynch, M. & Walsh, B. Genetics and analysis of quantitative traits, (Sunderland, MA:

Sinauer Associates, 1998).

9. Schadt, E.E. Novel integrative genomics strategies to identify genes for complex traits.

Anim Genet 37 Suppl 1, 18-23 (2006).

10. Schadt, E.E., Monks, S.A. & Friend, S.H. A new paradigm for drug discovery: integrating

clinical, genetic, genomic and molecular phenotype data to identify drug targets. Biochem

Soc Trans 31, 437-43 (2003).

Nature Genetics: doi:10.1038/ng.3538

Page 20: Nature Genetics: doi:10.1038/ng · by a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signal was defined as the top associated

20

11. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to

enhance discovery from GWAS. PLoS Genet 6, e1000888 (2010).

12. Plagnol, V., Smyth, D.J., Todd, J.A. & Clayton, D.G. Statistical independence of the

colocalized association signals for type 1 diabetes and RPS26 gene expression on

chromosome 12q13. Biostatistics 10, 327-34 (2009).

13. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic

association studies using summary statistics. PLoS Genet 10, e1004383 (2014).

14. Guo, H. et al. Integration of disease association and eQTL data using a Bayesian

colocalisation approach highlights six candidate causal genes in immune-mediated diseases.

Hum Mol Genet 24, 3305-13 (2015).

15. Gamazon, E.R. et al. A gene-based association method for mapping traits using reference

transcriptome data. Nat Genet 47, 1091-8 (2015).

16. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association

studies. Nat Genet 48, 245-52 (2016).

Nature Genetics: doi:10.1038/ng.3538