nature genetics: doi:10.1038/ng · by a gcta-cojo analysis6, we selected the top gwas snp,...
TRANSCRIPT
Supplementary Figure 1
QQ plots of P values from the SMR tests under a range of simulation scenarios.
The simulation strategy is described in the Supplementary Note. Shown are the results from 10,000 simulations. MR one sample,Mendelian randomization analysis in one sample; SMR one sample, summary data–based Mendelian randomization analysis in one sample; SMR two samples, the effect sizes of a SNP on gene expression (βzx) and a SNP on phenotype (bzy) are estimated from two independent samples.
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 2
Prioritizing genes at the NMRAL1 locus for schizophrenia.
(a) Top, P values from the GWAS for schizophrenia1 (brown dots) and P values from SMR tests (diamonds) using Westra eQTL data2.Bottom, P values from the Westra and Ramasamy eQTL studies3 for NMRALI1. Shown in a are all the SNPs available in the GWAS and eQTL data. (b,c) Effect sizes of the SNPs (used for the HEIDI test) from the GWAS against those from the Westra and Ramasamy eQTL studies. The orange dashed lines represent the estimate of bxy at the top cis-eQTL (rather than the regression line). Error bars are the standard errors of SNP effects.
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 3
Genes with multiple tagging probes that pass the SMR and HEIDI tests—the ANKRD55 locus for rheumatoid arthritis.
(a) Top, dots in brown represent P values for SNPs from the latest GWAS for rheumatoid arthritis4, diamonds represent P values for probes from the SMR analysis and triangles represent probes without a cis-eQTL at PeQTL < 5.0 × 10–8. Probes that passed the HEIDI test (PHEIDI ≥ 0.05) are highlighted in red. Bottom, P values from eQTL studies for the two probes that both tag ANKRD55. Shown in aare all SNPs available in the GWAS and eQTL summary data (rather than those in common). (b,c) Effect sizes of the SNPs (used inthe HEIDI test) from the GWAS against those from the eQTL study for the two probes. The orange dashed lines represent the estimate of bxy at the top cis-eQTL. Error bars are the standard errors of SNP effects. (d) Plot of expression levels of the two probes against each other in the data (328 unrelated individuals) from the Brisbane Systems Genetics Study5.
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 4
Power of detecting heterogeneity in bxy as a function of LD between two causal variants in simulations.
We simulated data under the alternative hypothesis that there are two distinct causal variants, one affecting gene expression and one affecting the trait. The method for simulation is described in the Supplementary Note.
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 5
Distribution of LD between the top associated GWAS SNPs and the simulated causal eQTLs.
The results are from an SMR analysis of real GWAS data for five complex traits (Supplementary Table 6) and simulated eQTL data. The eQTL data were simulated on the basis of real genotypes, mimicking the Westra study but with the causal variants of cis-eQTLs randomly placed across the genome (see the Supplementary Note for details). There were a number of simulated probes that passedthe SMR and HEIDI tests (Supplementary Table 14). For these probes, the overlaps between the GWAS and eQTL signals are due to the strong LD between the GWAS causal variants and the causal eQTLs. Unfortunately, the GWAS causal variants are unknown.Shown are LD r2 values for the top associated GWAS SNPs and the simulated causal eQTLs in these probe regions, which are likely tobe underestimations of LD between the causal variants. For regions where there were multiple independent GWAS signals as indicatedby a GCTA-COJO analysis6, we selected the top GWAS SNP, conditioning the secondary GWAS signal (a secondary GWAS signalwas defined as the top associated GWAS SNP in the region when conditioning on the GWAS SNP used in the SMR test).
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 6
Estimating SNP effects from z statistics.
The reported GWAS effect sizes are equivalent to the SNP effects reported in the GWAS summary data. The estimated GWAS effect sizes are equivalent to the SNP effects estimated from z statistics using the sample size of the whole study rather than the per-SNP sample size reported in the summary data (Supplementary Note). This is just a demonstration of the difference between the reported estimate of effect size and the effect size estimated from z statistics without taking the per-SNP sample size into account. In our SMR analyses for height, we used the reported estimate of effect size.
Nature Genetics: doi:10.1038/ng.3538
Supplementary Figure 7
An example of pleiotropic signal being diluted by multiple associated SNPs in GWAS.
(a) Effect sizes of the SNPs (used in the HEIDI test) from the latest GWAS for height7 against those from the Westra eQTL study2 at the SPI1 locus. We detected a secondary signal in the GWAS data. (b) Effect sizes conditioning on the secondary signal. The orangedashed lines represent the estimate of bxy at the top cis-eQTL. Error bars are the standard errors of SNP effects.
Nature Genetics: doi:10.1038/ng.3538
8
Supplementary Note
1. Simulation strategy
We conducted a series of simulations to compare SMR with MR under two assumptions, causality
and pleiotropy.
Direct causality model: We simulated the genotype of a SNP (z) from a binomial distribution, w ~
B(2, p), with p being the allele frequency of the SNP, p ~ U(0.05, 0.5). The SNP genotypes were
then standardized as z = (w− 2 p) / 2 p(1− p) . We simulated the gene expression based on the
model x = zbzx + c+ ezx , where c is a latent confounding variable with c ~ N(0,σ c2 ) ;
σ c2 = var(zbzx )Rc
2 Rzx2 ; Rc
2 is the proportion of variance in x explained by c, Rc2 ~U (0,0.5) ; Rzx
2 is
the proportion of variance in x explained by z, Rzx2 = 0.05, 0.1 or 0.2;
ezx ~ N (0,var(wbzx )(1 / Rzx2 −1)−σ c
2 ) . We then standardized x with mean 0 and variance 1. We
simulated the trait based on the model y = xbxy + c+ exy , where exy ~ N[0,var(Xbxy )(1 / Rxy2 −1)−σ c
2 ] ;
Rxy2 is proportion of variance in y explained by x, Rxy
2 = 0.0, 0.01 or 0.05. We also standardized y
with mean 0 and variance 1. It is notable that the magnitude of bzx or bxy is not important because
they are scaled according to a fixed proportion of variance explained. We simulated 1,000
individuals for dataset 1, and simulated another 10,000 individuals for dataset 2. We repeated the
simulation for 10,000 times.
Direct pleiotropy model: We simulated the data using the same setting as above but generating the
trait phenotype based on the model y = zbzy + c+ ezy where ezy ~ N 0,var(zbzy ) / (1 / Rzy2 −1)−σ c
2"#
$%
and Rzy2 is proportion of variance in y explained by z with rzy
2 = 0.0, 0.005 or 0.01.
Indirect causality model: We simulated the genotype of a SNP using the same method as described
above. The latent variable c was simulated based on the model, c = zbzc + ezc where ezc is generated
from a normal distribution ezc ~ N (0,σ zc2 ) , σ zc
2 = var(zbzc )(1 / Rzc2 −1) , and Rzc
2 is the proportion of
Nature Genetics: doi:10.1038/ng.3538
9
variance in c explained by z ( Rzc2 = 0.6 ). We standardized c and generated x based on the model
x = cbcx + ecx , where ecx ~ N (0,var(cbcx )(1 / Rcx2 −1)) and Rcx
2 is the proportion of variance in x
explained by c ( Rcx2 = 0.05, 0.1 and 0.2). We then standardized z and generated y based on the model
y = xbxy + exy where exy ~ N (0,var(xbxy )(1 / Rxy2 −1)) and Rxy
2 =0, 0.01 and 0.05. We standardized y
with mean 0 and variance 1. We simulated 1,000 individuals for dataset 1, and 10,000 individuals
for dataset 2. We repeated the simulation for 10,000 times.
Indirect pleiotropy model: We simulated z and x using the same method as in the indirect causality
model but generating the trait based on the model y = cbcy + ecy , where
ecy ~ N (0,var(cbcy )(1 / Rcy2 −1)) and Rcy
2 is the proportion of variance in x explained by c ( Rcy2 = 0,
0.005 and 0.01).
In each simulation replicate, we estimated bzx and bzy by a linear regression analysis, and
performed the MR and SMR analyses following the methods described in the main text (Online
Methods). To investigate whether bxy estimated from the SMR method is consistent with the effect
on x of y due to genetic factor, we estimated b(y~gx) by a linear regression analysis of y on the
genetic value of x (gx).
2. Estimating SNP effect from z-statistic
In an association analysis, if we assume the phenotype (y) has been standardized to mean 0 and
variance 1, the standard error (SE) of the estimate of SNP effect can be calculated as
SE = σ e2
nvar(w)=1− 2p(1− p)b2
2p(1− p)n
where σ e2 is the residual variance of regressing y on w, w is the genotype variable of a SNP (coded
as 0, 1 or 2), b is the SNP effect on trait, p is the minor allele frequency (MAF) of the SNP, and n is
Nature Genetics: doi:10.1038/ng.3538
10
the sample size. We know that the z-statistic can be calculated as z = b̂ / SE . We therefore have
b̂ = z / 2p(1− p)(n+ z2 ) and SE =1 / 2 p(1− p)(n+ z2 ) .
3. Sampling covariance between b̂xy estimated at two SNPs in LD
The sampling covariance between the bxy estimates at two correlated SNPs (SNP i and j) can be
written as
cov(b̂xy (i ) ,b̂xy ( j ) ) = E(b̂xy (i )b̂xy ( j ) )− E(b̂xy (i ) )E(b̂xy ( j ) )
= E(b̂zy (i )b̂zy ( j )β̂zx(i )β̂zx( j )
)− E(b̂zy (i )β̂zx(i )
)E(b̂zy ( j )β̂zx( j )
)
Using the Delta method8, we get
E(b̂zy (i )b̂zy ( j )β̂zx(i )β̂zx( j )
) ≈bzy (i )bzy ( j )βzx(i )βzx( j )
+bzy (i )bzy ( j )β 3zx(i )βzx( j )
var(β̂zx(i ) )+bzy (i )bzy ( j )βzx(i )β
3zx( j )
var(β̂zx( j ) )
+cov(b̂zy (i ) ,b̂zy ( j ) )
βzx(i )βzx( j )+bzy (i )bzy ( j )β 2zx(i )β
2zx( j )
cov(β̂zx(i ) , β̂zx( j ) )
E(b̂zyβ̂zx) =bzyβzx
1+ var(β̂zx )β 2zx
−cov(b̂zy , β̂zx )
bzyβzx
"
#$$
%
&''=bzyβzx
1+ 1χ zx2−cov(b̂zy , β̂zx )bzyβzx
"
#$$
%
&'' , where χ zx
2 is the chi-
squared statistic to test for the effect of z on x. Under the null, cov(b̂zy , β̂zx ) = 0 in particular if they
are estimated from two independent studies. Since z is selected to be strongly associated with x,
1+ 1χ zx2≈1 . We get E(
b̂zyβ̂zx) ≈bzyβzx
and E(b̂zy (i )β̂zx(i )
)E(b̂zy ( j )β̂zx( j )
) ≈bzy (i )bzy ( j )βzx(i )βzx( j )
.
We therefore have
cov(b̂xy (i ) ,b̂xy ( j ) ) ≈cov(b̂zy (i ) ,b̂ zy ( j ) )
βzx(i )βzx( j )+bzy (i )b zy ( j )β 2zx(i )β
2zx( j )
cov(β̂zx(i ) , β̂zx( j ) )
−bzy (i )b zy ( j )β 3zx(i )β
3zx( j )
var(β̂zx(i ) ) var(β̂zx( j ) )
Nature Genetics: doi:10.1038/ng.3538
11
We know that the sampling correlation between the estimates of SNP effects equals to the LD
correlation between the SNPs. We then get
cov(b̂xy (i ) ,b̂xy ( j ) ) ≈r
βzx(i )βzx( j )var(b̂zy (i ) ) var(b̂ zy ( j ) ) +bxy (i )bxy ( j )
rzzx(i )zzx( j )
−1
z2zx(i )z2zx( j )
#
$%%
&
'((
where zzx = β̂zx / var(β̂zx ) .
4. Test of heterogeneity in bxy estimated at two linked SNPs for two probes
Following the derivation above (Supplementary Note #3), the sampling covariance between the
estimates of bxy for two probes (x1 and x2) can be expressed as
cov(b̂xy (1i ) ,b̂xy (2 j ) ) ≈cov(b̂zy (1i ) ,b̂ zy (2 j ) )
βzx(1i )βzx(2 j )+bzy (1i )b zy (2 j )β 2zx(1i )β
2zx(2 j )
cov(β̂zx(1i ) , β̂zx(2 j ) )
−bzy (1i )b zy (2 j )β 3zx(1i )β
3zx(2 j )
var(β̂zx(1i ) ) var(β̂zx(2 j ) )
where the subscripts i and j represent two SNPs, and cov(β̂zx(1i ) , β̂zx(2 j ) ) is the sampling covariance
between the estimates of effect sizes of two eQTLs for two probes (see Supplementary Note #3 for
the definitions of the other parameters).
The least squares (LS) estimates of two eQTL effects can be expressed as
β̂zx(1i ) = !zix1 !zizi( ) and β̂zx(2 j ) = !z jx2 !z jz j( ) , with sampling variance var(β̂zx(1i ) ) = var(x1) !zizi( )
and var(β̂zx(2 j ) ) = var(x2 ) !z jz j( ) , where z is a vector of standardized SNP genotypes and x is a
vector of standardized gene expression levels. We therefore have the sampling covariance between
two LS estimates as
cov(β̂zx(1i ) , β̂zx(2 j ) ) = cov !zix1, !z jx2( ) !zizi !z jz j( ) = !ziz j cov x1,x2( ) !zizi !z jz j( )
= rijrx var(β̂zx(1i ) ) var(β̂zx(2 j ) )
where rij is the LD correlation between SNPs, and rx is the phenotypic correlation between probes.
Nature Genetics: doi:10.1038/ng.3538
12
We therefore can have a heterogeneity test-statistic as (β̂zx(1i ) − β̂zx(2 j ) )2 / var(β̂zx(1i ) − β̂zx(2 j ) )
which follows χ12 under the null hypothesis that there is no difference, where
var(β̂zx(1i ) − β̂zx(2 j ) ) = var(β̂zx(1i ) )+ var(β̂zx(2 j ) )− 2cov(β̂zx(1i ) , β̂zx(2 j ) ) .
5. Mean distance of probes and cis-eQTLs to the transcription end sites
We identified 9 genes for which there was a significant difference in bxy between the tagging probes
(Supplementary Table 10). We then tested whether the secondary probes of these genes (9 probes
that did not passed the SMR or HEIDI test) tend to be located near the transcription end sites (TES).
We calculated mean distance of the probes (as well as their top associated cis-eQTLs) to TES,
which was then compared with what one would expect if probes (or cis-eQTLs) were sampled at
random. The probes were sampled from all genes with matched gene lengths (genes were classified
into bins by length with 2Kb intervals), and the cis-eQTLs were sampled from all the cis-eQTLs
with matched MAFs (cis-eQTLs were binned by MAF with 10 intervals). The whole process was
repeated 10,000 times.
6. A summary description of the methods for detecting gene-trait association using GWAS
and eQTL data
There are other methods for detecting gene-trait associations using GWAS and gene expression
data. For example, prior to the GWAS era, Schadt and colleagues9,10 conceptualized an integrative
analysis of genetic and eQTL data for fine mapping and identification of drug targets. Nicolae et
al.11 showed that trait-associated variants identified from GWAS were enriched for eQTLs and
proposed to use this information to identify novel genetic risk factors for diseases. COLOC12-14 is a
Bayesian analysis approach that uses GWAS and eQTL P values to assess whether two association
signals (one from GWAS and one from eQTL study) are consistent with a single shared causal
variant. This is a very useful tool to detect colocalization of GWAS and eQTL signals at known
Nature Genetics: doi:10.1038/ng.3538
13
GWAS loci, and to distinguish between models (for example, one causal variant vs. two causal
variants). COLOC does not provide thresholds for the posterior probabilities to control for genome-
wide false positive rate (there were multiple thresholds used in Giambartolomei, et al.13), which
makes it challenging to apply the method for genome-wide scan. PrediXcan15 links GWAS and
eQTL data by a polygenic prediction analysis. It requires individual-level genotype and gene
expression data in the training set, and individual-level genotype and phenotype data in the target
set, which limits the power because of the limited sample sizes of such data sets at present. Gusev et
al.16 proposed an elegant method to overcome this problem by performing a polygenic prediction
analysis using summary-level data. Both PrediXcan and the Gusev et al. method do not distinguish
between pleiotropy and linkage.
7. Quantifying the power of detecting heterogeneity in bxy as a function of LD between two
causal variants by simulations
We used simulations to calibrate the extent to which the power of detecting heterogeneity in bXY
decreases with LD between two causal variants. We performed simulations based on the HM2-
imputed ARIC data set. We used 182,567 SNPs on chromosome 1 with minor allele frequency
(MAF) ≥ 0.01 and Hardy-Weinberg Equilibrium (HWE) P value ≥ 1×10–6. In each simulation, we
randomly generated a physical position, and included the SNPs within ±250Kb of the position as
“cis-eQTLs”. We then randomly sampled from the cis-eQTLs two SNPs as causal variants (z1 and
z2), one for gene expression and one for trait. We generated gene expression value based on the
model x = z1bzx + c+ ezx , and trait value based on the model y = z2bzy + c+ ezy , where Rzx2
= 0.2 and
Rzy2 = 0.01 (see above Supplementary Note #1 for the definitions of the parameters and variables).
We sampled the two causal variants with LD from low to high in ten categories, and repeated the
simulation 1,000 times. We analyzed the simulated data (excluding the causal variants) using the
SMR method with the same criteria as used in real data analysis, for example, including only the
Nature Genetics: doi:10.1038/ng.3538
14
genes with a cis-eQTL at PeQTL < 5×10–8 for SMR test, and performing HEIDI test only for genes
with PSMR < 8.4×10–6. The simulation results are shown in Supplementary Fig. 4.
8. SMR analysis with simulated eQTL data
We performed an analysis to quantify the enrichment of the observed number of probes that passed
the SMR and HEID tests compared with what one would expect if the eQTLs were randomly
placed. We could not simply shuffle the positions of either the GWAS SNPs or the eQTLs because
of LD. We therefore used a simulation-based approach. We simulated eQTL data to mimic the
Westra eQTL study using real genotype data (6,654 individuals and 2,413,012 SNPs with MAF ≥
0.01 from the HapMap 2-imputed ARIC study). We randomly generated a physical position apart
from MHC region as the midpoint of a probe, and defined a ±250Kb window centered on the
position as a cis-eQTL region. We then randomly sampled a SNP from the cis-eQTL region as
causal variant, and generated gene expression level (x) by a simple additive model x = zbzx + c+ ezx
(see Supplementary Note #1 for the definitions of the parameters and variables). To achieve
similar power as in the Westra study, we assigned Rzx2 = χ 2 −1( ) χ 2 −1+ n( ) , where n is the sample
size (n = 6,554), and χ 2 is the chi-squared statistic of the top cis-eQTL for a probe in the Westra
data (the χ 2 values of the top cis-eQTLs included in the SMR analysis were > 29.7, equivalent to
p-values < 5×10–8). We estimated the eQTL effect sizes and standard errors by a linear regression
analysis of the simulated gene expression level on SNPs. We repeated this process and generated
eQTL data for 5,967 probes as used in real data analysis. We then preformed the SMR analyses
using the simulated eQTL data and real-life GWAS summary data for height, BMI, WHRadjBMI,
RA and SCZ (excluding the MHC region as in real data analysis). The whole simulation process
was repeated 100 times.
9. Four reasons why the HEIDI test is over-conservative
Nature Genetics: doi:10.1038/ng.3538
15
There are a least four reasons for the HEIDI test being too conservative. First, we used a
significance level of 0.05 without correcting for multiple testing. This means that under the null
hypothesis of no heterogeneity (that is, all the associations between trait and gene expression
detected by the initial SMR test are due to pleiotropy), we will eliminate 5% of the real findings.
Secondly, we used summary data in the public domain that inevitably contain some errors, for
example, the effect alleles may be mislabeled, which will be detected as heterogeneity. Thirdly, we
estimated eQTL effect sizes from z-statistics without taking into account the difference in per-SNP
sample size (per-SNP sample size was not available in the Westra eQTL data), which generated
additional variation in the estimates of eQTL effect sizes. Shown in Supplementary Fig. 6 is a
demonstration of the difference between the reported effect size, and the effect size estimated from
z-statistic without taking per-SNP sample size into account using the latest GIANT summary data
for height. Last but not least, multiple association signals at single loci in either GWAS or eQTL
studies will be detected as heterogeneity (see Supplementary Fig. 7 for an example).
10. The interpretations of bxy under different assumptions
The estimate of bxy from the SMR analysis has a direction (that is, positive or negative sign). For a
case-control study, the SMR estimate of bxy using eQTL data from an independent population-based
study (for example, the Westra eQTL data used in this study) is a biased estimate of that if eQTL
data are available in same case-control sample. Therefore, the SMR estimate of bxy for diseases
needs to be interpreted with great caution. For quantitative traits, under the assumption of causality,
bxy can be interpreted as the effect of gene expression on trait free of non-genetic confounders.
Therefore, in the absence of confounders, the estimate of bxy from the SMR analysis should be
consistent with that estimated from a direction regression of trait (y) on gene expression (x). In this
case, the power of detecting bxy in a direct association analysis between x and y is approximately a
function of nRzy2 Rzx
2 for polygenic traits (see Supplementary Note #12 more details), where Rzx2
and Rzy2 are the proportions of variance in trait and gene expression explained by the top cis-QTL,
Nature Genetics: doi:10.1038/ng.3538
16
respectively. In this scenario (causality), the power to detect an x-y association (TWAS) is higher
than that to detect a z-y association (GWAS) given the same sample size. On the other hand, under
the assumption of pleiotropy, the SMR estimate of bxy does not have an explicit interpretation, and
in the absence of confounders (trait and gene expression are correlated only through a single causal
variant), the power of detecting the direction association between x and y is approximately a
function of nRzy2 Rzx
2 (see Supplementary Note #12 for more details). Under this model (pleiotropy),
the power of detecting an x-y association (TWAS) is smaller than that of detecting a z-y association
(GWAS) given the sample size.
11. Distinguishing causality from pleiotropy
We tested against the hypothesis of causality using the trans-eQTLs in the Westra eQTL data. For
the 104 genes prioritized by the SMR analysis using SNPs in cis-eQTL regions, only 3 of them (2
for height and 1 for BMI) have trans-eQTLs at PeQTL < 5×10–8 (Supplementary Table 15). This is
because the genetic architecture of expression traits are relatively simple (for many gene expression
traits, the majority of heritability is attributed to the cis-eQTL) and because the sample size of the
Westra study is still not large enough to detect the trans-eQTLs whose effect sizes are usually much
smaller than those of cis-eQTLs2. For each of the four genes, there were only two genome-wide
significant trans-eQTLs (PeQTL < 5.0×10–8) that were almost in perfect LD with each other (LD r2 >
0.95). We estimated the bxy at the top trans-eQTL (bxy(top-trans)) using the SMR method, and
compared it with that estimated at the top cis-eQTL (bxy (top-cis)). The difference between bxy (top-cis)
and bxy (top-trans) was tested using the test statistic
TDiff = [b̂xy(top-cis) − b̂xy (top-trans) ]2 / [var(b̂xy (top-cis) )+ var(b̂xy (top-trans) )] , TDiff ~ χ1
2 . There was no covariance
between b̂xy (top-cis) and b̂xy (top-trans) because the trans-eQTLs were all independent from the
corresponding cis-eQTLs (LD r2 < 0.001 in ARIC data). We found that none of them was
significantly different from zero (Supplementary Table 15). Interestingly, the effect size of
FRAP1 on height estimated using the top associated cis-eQTL (bxy (top-cis) = 0.09, s.e. = 0.02) was
Nature Genetics: doi:10.1038/ng.3538
17
significantly different (PDifference = 6.2×10–5) from that using the top trans-eQTL (bxy (top-trans) = 0.0,
s.e. = 0.01), consistent with a model of pleiotropy rather than causality for this gene.
In addition, for the probes with multiple independent eQTL signals (as implicated by the
GCTA conditional analyses), we tested the difference between the estimates of bxy at the primary
and secondary signals using the test statistic
TDiff = [b̂xy (top) − b̂xy (sec) ]2 / [var(b̂xy (top) )+ var(b̂xy (sec) )− 2cov(b̂xy (top) ,b̂xy (sec) )] where b̂xy (sec) is the estimate
of bxy at the secondary signal using the original eQTL and GWAS summary data, and the variance
and covariance terms can be calculated using the method described above. We found a significant
(correcting for multiple tests) difference for 8 out of 12 genes (Supplementary Table 16),
consistent with the model of pleiotropy. However, the results are not conclusive. The bxy estimates
at all the secondary signals were not significantly different from 0 even at a nominal significance
level (threshold P value = 0.05). The LD correlations between the primary and secondary cis-eQTL
signals were relatively large compared with those reported in previous studies using GWAS data6,7,
suggesting that it is still likely that the two eQTL signals are in LD with a single underlying causal
variant6. If this is the case, we could not reject the linkage model (as in Supplementary Fig. 4 that
if GWAS and eQTL causal variants are in very LD, there is little power to distinguish between
models using only the SNPs in the cis region).
12. Statistical power of detecting a direct association between trait and gene expression under
two assumptions
If we have gene expression (x) and phenotype (y) data available in the same sample, the power, as
measured by the non-centrality (NCP) of the chi-square statistic, of detecting the direct association
between trait and gene expression is NCP = bXY2 / var(b̂XY
2 ) , where bxy is the slope of regressing y
and x based on the model y = xbxy + exy .
Nature Genetics: doi:10.1038/ng.3538
18
Causality: Under the assumption of causality where the effect size of a genetic variant (z) on
phenotype is mediated by gene expression, bxy2 can be expressed as (assuming both x and y have
been standardized with mean 0 and variance 1)
bxy2 = cov(x, y) var(x)!" #$
2= cov(x,xbxy + exy )!" #$
2= bxy
2 = Rxy2
where Rxy2 is defined as the proportion of variance in y that can be explained by x in the absence of
non-genetic confounders. The sampling variance of b̂xy is
var(b̂xy ) = var(exy ) nvar(x)!" #$= 1−bxy2( ) n = 1− Rxy2( ) n
where n is the sample size. Since bxy = bzy bzx , i.e. Rxy2 = Rzy
2 Rzx2 , we have
NCPcausality = nRxy2 1− Rxy
2( ) = nRzy2 Rzx2 − Rzy
2( ) . For polygenic traits where Rzy2 (proportion of
phenotypic variance explained by a single genetic variant) is small relative to Rzx2 (proportion of
variance in gene expression explained by a genetic variant), we then have NCPcausality ≈ nRzy2 Rzx
2
Pleiotropy: Under the assumption of pleiotropy, a genetic variant has direct effects on both gene
expression and trait phenotype, i.e. x = zbzx + ezx and y = zbzy + ezy , and assuming x and y are
correlated only through z, we have bxy2 = cov(x, y) var(x)!" #$
2= cov(zbzx + ezx , zbzy + ezy )!" #$
2= Rzx
2 Rzy2
The sampling variance is var(b̂xy ) = 1− Rzx2 Rzy
2( ) n . We then get
NCPpleiotropy = nRzx2 Rzy
2 1− Rzx2 Rzy
2( ) . Again, for polygenic traits where the phenotypic variance
explained by an individual variant is small, we have NCPpleiotropy ≈ nRzx2 Rzy
2 .
13. Acknowledgements
The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by
National Heart, Lung, and Blood Institute contracts (HHSN268201100005C,
Nature Genetics: doi:10.1038/ng.3538
19
HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C,
HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C), R01HL087641,
R01HL59367 and R01HL086694; National HumanGenome Research Institute contract
U01HG004402; and National Institutes of Health contract HHSN268200625226C. The authors
thank the staff and participants of the ARIC study for their important contributions. Infrastructure
was partly supported by Grant Number UL1RR025005, a component of the National Institutes of
Health and NIH Roadmap for Medical Research.
14. References
1. Schizophrenia Working Group of the Psychiatric Genomics, C. Biological insights from 108
schizophrenia-associated genetic loci. Nature 511, 421-7 (2014).
2. Westra, H.J. et al. Systematic identification of trans eQTLs as putative drivers of known
disease associations. Nature Genetics 45, 1238-U195 (2013).
3. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions
of the human brain. Nat Neurosci 17, 1418-28 (2014).
4. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery.
Nature 506, 376-81 (2014).
5. Powell, J.E. et al. The Brisbane Systems Genetics Study: Genetical Genomics Meets
Complex Trait Genetics. Plos One 7(2012).
6. Yang, J. et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics
identifies additional variants influencing complex traits. Nat. Genet. 44, 369-75 (2012).
7. Wood, A.R. et al. Defining the role of common variation in the genomic and biological
architecture of adult human height. Nat Genet 46, 1173-86 (2014).
8. Lynch, M. & Walsh, B. Genetics and analysis of quantitative traits, (Sunderland, MA:
Sinauer Associates, 1998).
9. Schadt, E.E. Novel integrative genomics strategies to identify genes for complex traits.
Anim Genet 37 Suppl 1, 18-23 (2006).
10. Schadt, E.E., Monks, S.A. & Friend, S.H. A new paradigm for drug discovery: integrating
clinical, genetic, genomic and molecular phenotype data to identify drug targets. Biochem
Soc Trans 31, 437-43 (2003).
Nature Genetics: doi:10.1038/ng.3538
20
11. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to
enhance discovery from GWAS. PLoS Genet 6, e1000888 (2010).
12. Plagnol, V., Smyth, D.J., Todd, J.A. & Clayton, D.G. Statistical independence of the
colocalized association signals for type 1 diabetes and RPS26 gene expression on
chromosome 12q13. Biostatistics 10, 327-34 (2009).
13. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic
association studies using summary statistics. PLoS Genet 10, e1004383 (2014).
14. Guo, H. et al. Integration of disease association and eQTL data using a Bayesian
colocalisation approach highlights six candidate causal genes in immune-mediated diseases.
Hum Mol Genet 24, 3305-13 (2015).
15. Gamazon, E.R. et al. A gene-based association method for mapping traits using reference
transcriptome data. Nat Genet 47, 1091-8 (2015).
16. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association
studies. Nat Genet 48, 245-52 (2016).
Nature Genetics: doi:10.1038/ng.3538