pathway-driven gene stability selection of two rheumatoid arthritis

13
Pathway-driven gene stability selection of two rheumatoid arthritis GWAS identifies and validates new susceptibility genes in receptor mediated signalling pathways Hariklia Eleftherohorinou 1, , Clive J. Hoggart 1 , Victoria J. Wright 2 , Michael Levin 2, { and Lachlan J.M. Coin 1, { 1 Department of Epidemiology and Biostatistics and 2 Department of Paediatrics, Division of Medicine, Imperial College London, Norfolk Place, London W2 1PG, UK Received January 11, 2011; Revised May 17, 2011; Accepted May 31, 2011 Rheumatoid arthritis (RA) is the commonest chronic, systemic, inflammatory disorder affecting 1% of the world population. It has a strong genetic component and a growing number of associated genes have been discovered in genome-wide association studies (GWAS), which nevertheless only account for 23% of the total genetic risk. We aimed to identify additional susceptibility loci through the analysis of GWAS in the con- text of biological function. We bridge the gap between pathway and gene-oriented analyses of GWAS, by introducing a pathway-driven gene stability-selection methodology that identifies potential causal genes in the top-associated disease pathways that may be driving the pathway association signals. We analysed the WTCCC and the NARAC studies of 5000 and 2000 subjects, respectively. We examined 700 pathways comprising 8000 genes. Ranking pathways by significance revealed that the NARAC top-ranked 6% laid within the top 10% of WTCCC. Gene selection on those pathways identified 58 genes in WTCCC and 61 in NARAC; 21 of those were common (P overlap < 10 221 ), of which 16 were novel discoveries. Among the identified genes, we validated 10 known RA associations in WTCCC and 13 in NARAC, not discovered using single-SNP approaches on the same data. Gene ontology functional enrichment analysis on the identified genes showed significant over-representation of signalling activity (P < 10 229 ) in both studies. Our findings suggest a novel model of RA genetic predisposition, which involves cell-membrane receptors and genes in second messen- ger signalling systems, in addition to genes that regulate immune responses, which have been the focus of interest previously. INTRODUCTION Rheumatoid arthritis (RA) is the commonest inflammatory joint disease affecting over 1% of Caucasian populations. The disorder results in chronic inflammation of multiple per- ipheral joints leading to destruction of cartilage and bones, progressive deformity and severe disability. The aetiology and pathogenesis of RA remains unknown (1,2), but immunological factors clearly play a major role which is supported by the presence of inflammatory cells in the synovium, the association of the disease with auto- antibodies, the cellular infiltration in inflamed tissues and the response of patients to immunomodulatory therapy (3). However, a number of clinical features suggest involvement of non-immune mechanisms: the pro- gressive increase of RA incidence over the age of 50, the presence of subcutaneous nodules of proliferating cells at pressure points, the association of disease with osteopenia and proliferative and destructive lesions in cartilage, bones and other organs (1,2). To whom correspondence should be addressed at: Centre for Epidemiology and Biostatistics, Imperial College London, St Mary’s campus, Norfolk Place, London W2 1PG, UK. Tel: +44 2075941942; Fax: +44 2075943984; Email: [email protected] Both authors contributed equally to the work. # The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Human Molecular Genetics, 2011, Vol. 20, No. 17 3494–3506 doi:10.1093/hmg/ddr248 Advance Access published on June 8, 2011 Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070 by guest on 27 March 2018

Upload: vohuong

Post on 23-Jan-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pathway-driven gene stability selection of two rheumatoid arthritis

Pathway-driven gene stability selectionof two rheumatoid arthritis GWAS identifiesand validates new susceptibility genes in receptormediated signalling pathways

Hariklia Eleftherohorinou1,∗, Clive J. Hoggart1, Victoria J. Wright2, Michael Levin2,{

and Lachlan J.M. Coin1,{

1Department of Epidemiology and Biostatistics and 2Department of Paediatrics, Division of Medicine, Imperial College

London, Norfolk Place, London W2 1PG, UK

Received January 11, 2011; Revised May 17, 2011; Accepted May 31, 2011

Rheumatoid arthritis (RA) is the commonest chronic, systemic, inflammatory disorder affecting ∼1% of theworld population. It has a strong genetic component and a growing number of associated genes have beendiscovered in genome-wide association studies (GWAS), which nevertheless only account for 23% of thetotal genetic risk. We aimed to identify additional susceptibility loci through the analysis of GWAS in the con-text of biological function. We bridge the gap between pathway and gene-oriented analyses of GWAS, byintroducing a pathway-driven gene stability-selection methodology that identifies potential causal genes inthe top-associated disease pathways that may be driving the pathway association signals. We analysedthe WTCCC and the NARAC studies of ∼5000 and ∼2000 subjects, respectively. We examined 700 pathwayscomprising ∼8000 genes. Ranking pathways by significance revealed that the NARAC top-ranked ∼6% laidwithin the top 10% of WTCCC. Gene selection on those pathways identified 58 genes in WTCCC and 61 inNARAC; 21 of those were common (Poverlap< 10221), of which 16 were novel discoveries. Among the identifiedgenes, we validated 10 known RA associations in WTCCC and 13 in NARAC, not discovered using single-SNPapproaches on the same data. Gene ontology functional enrichment analysis on the identified genes showedsignificant over-representation of signalling activity (P < 10229) in both studies. Our findings suggest a novelmodel of RA genetic predisposition, which involves cell-membrane receptors and genes in second messen-ger signalling systems, in addition to genes that regulate immune responses, which have been the focus ofinterest previously.

INTRODUCTION

Rheumatoid arthritis (RA) is the commonest inflammatoryjoint disease affecting over 1% of Caucasian populations.The disorder results in chronic inflammation of multiple per-ipheral joints leading to destruction of cartilage and bones,progressive deformity and severe disability. The aetiologyand pathogenesis of RA remains unknown (1,2), butimmunological factors clearly play a major role which issupported by the presence of inflammatory cells in the

synovium, the association of the disease with auto-antibodies, the cellular infiltration in inflamed tissuesand the response of patients to immunomodulatorytherapy (3). However, a number of clinical featuressuggest involvement of non-immune mechanisms: the pro-gressive increase of RA incidence over the age of 50, thepresence of subcutaneous nodules of proliferating cells atpressure points, the association of disease with osteopeniaand proliferative and destructive lesions in cartilage, bonesand other organs (1,2).

∗To whom correspondence should be addressed at: Centre for Epidemiology and Biostatistics, Imperial College London, St Mary’s campus,Norfolk Place, London W2 1PG, UK. Tel: +44 2075941942; Fax: +44 2075943984; Email: [email protected]†Both authors contributed equally to the work.

# The Author 2011. Published by Oxford University Press. All rights reserved.For Permissions, please email: [email protected]

Human Molecular Genetics, 2011, Vol. 20, No. 17 3494–3506doi:10.1093/hmg/ddr248Advance Access published on June 8, 2011

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 2: Pathway-driven gene stability selection of two rheumatoid arthritis

Over the past decade, research has focused on identifyingthe genetic basis of RA as a means to understand thebiology of the disease. A growing number of genes havebeen associated with RA in recent genome-wide associationstudies (GWAS) (4–6). However, the 31 loci confirmed todate (6), including the well-established associations of theHLA genes of the major histocompatibility complex (MHC),explain only 23% of the genetic risk to RA (7), and their func-tional implication to disease pathogenesis is yet to be fullyexplained.

There is much debate about the source of this ‘missingheritability’ (8), but one possibility is that RA, as any othercommon complex disease, arises on the basis of the cumulat-ive effect of many variants yet to be discovered, because theyare either rare (with potentially large effect) or have asmall effect (but are potentially common). Such variants arechallenging to identify when performing standard single-SNPanalyses of GWAS data. Increasing the sample size viameta-analysis has been an effective way of detectingadditional variants; however, this strategy is ultimatelylimited by the number of genotyped cases in each study.

We have previously hypothesized that multiple variantswithin a gene or a biological pathway may act additively todetermine disease susceptibility, and have reported that wecould detect this cumulative effect by analysing GWAS atthe level of multiple SNPs within biological pathways (9).Indeed, various methods have been proposed to performsuch joint analysis using combination statistics (10), an adap-tive rank-truncated product (11), a multivariate score test(Hotelling’s test) (12), Fisher’s method for combiningP-values and the minimum P-value approach (13), aFourier-transform-based approach (14), genetic similaritymatrices (15) and principal component analysis (16,17). Gene-centric approaches (18–20) summarize the genetic variationof a gene, but they do not model the effect of multiple func-tionally interacting genes, which results in limited power(9,21). Pathway-driven approaches (6,22–24) are more power-ful, as they take into account the biological interplay of genesand they are attractive in that they provide insight into howmultiple genes might contribute to disease biology. Neverthe-less, the vast majority of these pathway methods do notaccount for the overlap of genes in multiple pathways, aswhere there are causal genes lying in multiple pathwaysthese pathways will show somewhat misleading association(25). Moreover, these approaches do not identify the causalgenes within pathways (6,22–24,26–28), which limits theirbiological interpretability.

In this report, we go beyond a simple pathway-based analy-sis of GWAS to identify the genes within associated pathwaysthat are predominantly responsible for the pathway associationsignal. To achieve this, we recast the question of how toassociate genes with a disease as a pathway-driven stability-selection problem. Stability selection is a technique that isbased on sub-sampling (of individuals) in combination withvariable selection algorithms (29). Variable selectionmethods attempt to identify combinations of variables (inthis case SNPs) that best explain the outcome (diseasestatus). Regularization parameters control the number of vari-ables selected: smaller models (stronger regularization) resultin a lower false discovery rate (FDR) but may miss

real-associated variables; larger models (weaker regulariz-ation) detect more real-associated variables but also admitmore false positives. Stability selection (29) involves variableselection on multiple random sub-samples of the data (in thiscase individuals), and across multiple regularization parametersettings (29). Variables which are consistently chosen arereported as ‘stably selected’.

RESULTS

We applied our approach on two RA case–control GWAS; theWellcome Trust Case Control Consortium (WTCCC) RAstudy, of 1860 cases and 2938 controls typed on the Affyme-trix 500 K chip (4) and the North American RheumatoidArthritis Consortium (NARAC) study of 908 cases and 1260controls genotyped on the Illumina Humanhap550 (5). Adiagram of our suggested strategy is given in Figure 1.

To identify the biological pathways with the genes respon-sible for RA susceptibility, we examined 700 curated path-ways, as assembled from KEGG, BioCarta and GenMAPP.We used our previously reported cumulative trend (CT) teststatistic (9) to examine the association between RA and thecumulative genetic variation of each biological pathway. Wecalculated the x2

1-scores from the CT P-values (SupplementaryMaterial, Fig. S1) and found that the correlation between thetwo studies was highly significant (Spearman’s r2¼ 0.69,P , 2.2 × 10216). This correlation persisted even afterremoval of pathways that included strong and confirmedassociated loci with RA, such as the HLA genes within theMHC region (Spearman’s r2¼ 0.32, P , 2.2 × 10216). Pair-wise correlation of pathway statistics between studies (whichwere carried out in independent populations and on differentgenotyping platforms) should indicate common genetic vari-ation associated with RA. Ranking of the pathways revealeda 95% overlap in the top-ranked pathways between the twostudies at a 2% rank cut-off (14 pathways), comprising path-ways which are dominated by the known MHC RA associ-ations; at the same time a 70% overlap was observed at a6% cut-off (42 pathways, Supplementary Material, Fig. S2a).To help choose a threshold for which significant pathways totake forward, we plotted the maximum rank in one study asa function of the rank threshold in the other study (Supplemen-tary Material, Fig. S2b and c) and we observed that themaximum WTCCC rank undergoes a step change from 10 to50% at this 6% threshold. Conversely, the maximumNARAC rank makes a similar transition at a lower 2%threshold, indicating that the WTCCC pathway ranks maybe less reliably reproducible in NARAC. We thus decidedthat taking forward the NARAC top-ranked 6%, comprising42 pathways, provided a good balance between introducingnoise and increasing the computational demands of the vari-able selection step, and potentially missing genuine-association signals. The 42 associated pathways with the CTP-values and their ranks in both studies are listed in Table 1and the database from which each pathway came is shownin Supplementary Material, Table S8.

A gene composition similarity matrix, as shown in Sup-plementary Material, Figure S3, revealed that many of the sig-nificant pathways share the same genes. We therefore

Human Molecular Genetics, 2011, Vol. 20, No. 17 3495

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 3: Pathway-driven gene stability selection of two rheumatoid arthritis

speculated that there is an underlying group of genes acting inthe associated pathways, which may be driving this correlationin pathway significance. We selected the 2156 genes compris-ing the 42 pathways significant in both studies to carry forwardto our gene stability selection. Stability selection was carriedout using the HyperLasso algorithm which implements vari-able selection in logistic regression (21). The penalty par-ameter f ′ of the HyperLasso controls the size of the fittedmodels. We defined a grid of penalty values f′xL and for

each value of f ′xL we generated 1000 random sub-sampleseach consisting of 50% of the individuals, sampled withoutreplacement, and generated a set of selected SNPs for eachsub-sample using HyperLasso (21). We then calculated selec-tion probabilities for each gene by mapping the selected SNPsto their corresponding genes. In order to avoid spuriouslyselecting large genes, we split genes into LD based gene-blocks. All subsequent analysis is carried out at the level ofgene blocks. The selection probability p(g|f ′) of a gene-block

Figure 1. Methodological pipeline.

3496 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 4: Pathway-driven gene stability selection of two rheumatoid arthritis

g at a value of f′ was defined as the fraction of sub-samples inwhich at least one SNP of the gene-block g was selected. Thestability path of a gene-block g was defined as the line that fitsthe selection probability over the range of all penalty valuesf′xL. Full description of the steps are provided in Materialsand Methods.

We broadly observed that stability paths with consistentlyhigh p(g|f′) across f′xL correspond to the major hits foundin single SNP analyses in each study; for instance TRAF1/C5, HLA-DRB1 and MICA in NARAC; IL2RA, PTPN22,HLA-DRA and MICA in WTCCC. At the other extreme, weexpect that stability paths with low p(g|f′) across f′xL com-prise non-associated gene-blocks. In between these twoextremes lie many stability paths that have low selection

probability under strong regularization and high selectionprobability under weak regularization.

In order to extract from this class the ‘stably selected’ vari-ables, Meinshausen and Buhlman (29) suggest choosing athreshold pthr, such that 0 , pthr, 1 and retaining everythingwith p(g|f′) . pthr , for some small f′xL. However, thisfocuses only on a single point of the stability path. We proposeinstead to weight the selection probabilities along all points ofthe path according to a prior belief p(f′) as to the reliability ofinferences made at each f′xL and then retain all genes withweighted sum

∑f ′[l

p( f ′)p(g′) . pthr . The WTCCC study had

greater power as it was more than twice the size of NARAC,and so generated substantially larger models for the same value

Table 1. Top 42 pathways identified through pathway-based analysis in WTCCC (W) and NARAC (N)

Pathway definition P-value Rank Numberof genes

Number of typed SNPsN W N W Illumina 550k Affy 500k

Bystander B-cell activation 1.00E 2 371 1.00E 2 98 1 11 4 66 53Type I diabetes mellitus 1.00E 2 349 1.00E 2 133 2 2 41 846 745IL-5 signalling pathway 1.00E 2 344 1.00E 2 118 3 7 10 164 134Antigen-dependent B-cell activation 1.00E 2 334 1.00E 2 111 4 8 8 114 92Activation of Csk by cAMP-dependent protein

kinase inhibited signalling through the T-cell receptor1.00E 2 332 1.00E 2 123 5 5 20 246 250

T-cell receptor signalling 1.00E 2 331 1.00E 2 211 6 1 264 5215 4613Cell adhesion molecules 1.00E 2 289 1.00E 2 68 7 15 127 4159 3811The role of eosinophils in the chemokine network of allergy 1.00E 2 288 1.00E 2 71 8 14 8 66 76Cytokines and inflammatory response 1.00E 2 283 1.00E 2 108 9 9 28 285 258B lymphocyte cell surface molecules 1.00E 2 276 1.00E 2 104 10 10 10 153 125The Co-stimulatory signal during T-cell activation 1.00E 2 275 1.00E 2 130 11 3 17 227 186Th1/Th2 differentiation 1.00E 2 270 1.00E 2 91 12 12 17 244 256Lck and Fyn tyrosine kinases in initiation of TCR activation 1.00E 2 254 1.00E 2 122 13 6 10 153 134Antigen processing and presentation 1.00E 2 243 1.00E 2 85 14 13 71 615 549Haematopoietic cell lineage 1.00E 2 238 1.00E 2 128 15 4 81 1132 1022Notch signalling pathway 1.00E 2 74 7.90E 2 06 16 62 43 826 606Proteasome complex 1.00E 2 43 1.91E 2 05 17 64 17 142 119Complement pathway 1.00E 2 41 5.31E 2 06 18 59 16 269 242Classical complement pathway 1.00E 2 37 4.56E 2 06 19 56 13 199 190Natural killer cell-mediated cytotoxicity 1.00E 2 35 9.28E 2 05 20 70 125 2122 1775Focal adhesion 1.00E 2 31 1.00E 2 48 21 16 195 5026 4586Cytokine–cytokine receptor interaction 1.00E 2 29 1.00E 2 42 22 17 249 2956 2854Calcium signalling pathway 1.00E 2 28 5.51E 2 12 23 37 168 5604 5109MAPK signalling 1.00E 2 27 1.00E 2 38 24 18 252 5335 4848Glycerophospholipid metabolism 1.00E 2 25 4.63E 2 06 25 57 49 988 838Glycerolipid metabolism 1.00E 2 24 7.03E 2 06 26 61 45 943 794Neuroactive ligand receptor interaction 1.00E 2 22 1.00E 2 37 27 19 234 5642 5323Toll like receptor 9 signalling via IRF5 1.00E 2 21 1.60E 2 06 28 53 17 169 184Smooth muscle contraction 1.00E 2 18 2.56E 2 06 29 54 138 4356 3871Phosphatidylinositol signalling system 1.00E 2 17 5.41E 2 10 30 43 85 2887 2569Tight junction 1.00E 2 17 5.92E 2 12 31 38 133 4125 3673Purine metabolism 1.00E 2 16 2.34E 2 15 32 29 143 3532 3165Gnrh signalling pathway 1.00E 2 15 5.34E 2 07 33 49 96 2714 2381Wnt signalling pathway 1.00E 2 14 1.00E 2 34 34 21 144 2830 2636Gap junction 1.56E 2 14 1.00E 2 22 35 25 92 2823 2582Valine leucine and isoleucine biosynthesis 1.77E 2 14 8.78E 2 05 36 69 12 171 183Cell communication 2.81E 2 14 8.83E 2 10 37 44 132 2304 2162Regulation of actin cytoskeleton 1.55E 2 13 1.00E 2 29 38 23 204 4206 3932Ecm receptor interaction 2.38E 2 13 1.36E 2 12 39 36 87 2541 2386Jak stat signalling pathway 1.30E 2 12 3.89E 2 15 40 30 149 1735 1699Taste transduction 1.70E 2 10 6.87E 2 07 41 50 49 1033 959Glycan structures biosynthesis 1 4.28E 2 08 4.81E 2 14 42 32 113 2985 2631

The cumulative trend (CT) statistic P-values are adjusted at a ¼ 0.05 with the Benjamini–Hochberg method that controls the FDR. Number of SNPs per pathwaymight differ between studies, because they refer to the typed markers. The pathways appear in the order of ascending NARAC ranks.

Human Molecular Genetics, 2011, Vol. 20, No. 17 3497

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 5: Pathway-driven gene stability selection of two rheumatoid arthritis

of f′. In order to equalize the power between these two data sets,we derived study-specific prior weights so as to equalize theexpected model size (E[size]), such that in the smaller studymore weight is given to the models fitted under looser valuesof f′ and vice-versa. (Full description of the methodology is pro-vided in Materials and Methods).

We calculated the number of selected gene-blocks whichoverlap between the two studies as well as the median FDRand 95% confidence interval (CI) among these commongenes (Fig. 2A). In addition, we sought to maximize thenumber of observed common gene-blocks between studies,while controlling the FDR (Fig. 2A). We conducted sensitivityanalysis around E[size], controlled by the prior distributionp(f′), and pthr (fully described in Materials and Methods). Agood trade-off was observed at E[size] ¼ 400 and pthr¼ 0.4,for which we selected 58 genes in WTCCC and 61 genes inNARAC (that is 60 gene-blocks in WTCCC and 64 gene-blocks in NARAC), of which 21 are common (P ¼ 10220)and the median FDR is 5% with 95% CI of [0–19%] asshown in Figure 2B. The selected gene-blocks inWTCCC and NARAC are shown in Supplementary Material,Tables S1 and S2, respectively. In order to investigate thepossible bias in favour of selecting gene-blocks which havea greater number of typed SNPs, we compared the distri-butions of number of typed SNPs (#SNPs) per gene-block

among selected gene-blocks and non-selected gene-blocks(however, still within pathways carried forward to variableselection) (Supplementary Material, Fig. S4). There is an over-representation of selected gene-blocks in the region20�,#SNPs�,70 (which shows a tendency to selectlarger gene-blocks) and otherwise an under-representation.To ensure that this bias does not invalidate our overlap stat-istics, we recalculated an overlap significance of P , 1024

based on sampling 10 000 times from the null distribution ofnumber of common gene blocks under the assumption of non-random selection of gene-blocks according to #SNPs (seeMaterials and Methods).

For the majority of genes, only one gene-block wasselected, but for ADCY2 and FHIT in WTCCC and FHITand PTPRN2 in NARAC, two/three LD blocks exhibitedstable selection. We considered a gene found in one study toreplicate in the other study, only if the same gene-block wasselected in both. Known associations comprise 10 of 58 inWTCCC; 13 of 61 in NARAC and 5 of 21 common genes.Of the 58 and 61 genes selected independently in NARACand WTCCC from an original pool of 2156 genes, morethan 50% were either common between studies or previousRA associations and hence can be considered as validated.The stability paths of the selected gene-blocks in each studyare shown in Figure 1C and D; the subset of stability paths

Figure 2. Genes selected by stability selection. (A) The solid lines show on the left y-axis the number of common genes selected in WTCCC and NARAC for differ-ent values of pthr (x-axis) and expected model sizes E[size] (depicted in different colors). Dashed lines show (1- median FDR) on the right y-axis and the 95% CI ofthe median is shown as error bars. (B) Fraction of selected genes in WTCCC and NARAC which are common for E[size] ¼ 400 at different values of pthr. Thenumber of selected genes per study (nWTCCC, nNARAC) is shown above the curves followed by the number of genes in common (r) and the hypergeometric probabilityof the observed overlap by chance. The median FDR and the 95% CI around it are given below the curve. The asterisk shows that for pthr ¼ 0.4, we achieve themaximum number of genes that overlap with a median FDR ¼ 5% [0,19%]. (C and D) Stability paths of the 60 and 64 gene-blocks selected in WTCCC (C) andNARAC (D), respectively. Gene names are shown in the legend followed by a number that indicates the selected gene-block.

3498 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 6: Pathway-driven gene stability selection of two rheumatoid arthritis

for previously known associations is shown in SupplementaryMaterial, Figure S5a and b and the stability paths of gene-blocks that were not selected are shown in SupplementaryMaterial, Figure S5c and d.

We then investigated whether common genes were drivingthe association signals of multiple pathways by mapping theselected genes back to their original pathways (Table 2).The striking majority of selected genes, to name a few: theHLA genes, ADCY2, ADCY8, CACNA1A, DGKH, FLNB,ITGA1, PRKCA, PRKCQ, SHC4, VAV3 were indeed involvedin more than one of the 42 pathways. On the other hand, someselected genes such as IRF5, MAPKAPK2, PDE10A, PDE1C,PSMD13 and TEC were only involved in one pathway(Table 2). We observed that the very top-ranked pathwayswere mostly driven by the HLA genes and six pathwayswere driven entirely by known associations, the HLA-DRA,HLA-DRB1 and CD28 genes in particular. The majority ofpathway signals (18 pathways) were driven by a combinationof previously reported associations as well as novel associ-ations. A further 14 pathways were driven only by ournewly reported associations and surprisingly four pathwayshad no gene selected.

Finally, we calculated the inverse variance weightedmeta-analysis P-value for the SNP selected most frequentlyin each study in the 21 commonly selected genes usingPLINK (30). The meta-analysis P-value was ,0.05 for allSNPs (Table 3).

In order to characterize the functional role of the identifiedgenes, we assigned the identified genes to their cellularlocation using Ingenuity Pathways Analysis (Fig. 3). Therewas a significant over-representation of genes located in thecell membrane (compared with the location of all annotatedhuman genes), all of which are involved in signalling fromthe cell exterior to the interior. We then performed a geneontology (GO) functional enrichment analysis using GOrilla(31) and we finally created a schematic map (Fig. 4) of howthese genes might act together in biological processes to deter-mine RA susceptibility.

In addition to the expected involvement of HLA-mediatedprocesses (GO:0032395, P ¼ 5.1 × 10214), we observed sig-nificant enrichment in multiple GO functional categories, (Sup-plementary Material, Tables S3 and S4), including membranereceptor-mediated signalling, integrin-mediated signalling(GO:0007229, P ¼ 1.53 × 1025), cytokine receptor(GO:0004896, P ¼ 1.72 × 1028) and growth factor binding(GO:0019838, P ¼ 1.09 × 1027), as well as adenylate cyclaseactivity (GO:0004016, P ¼ 4.7 × 1025), 3′ 5′ cyclic nucleotidephosphodiesterase activity (GO:0004114, P ¼ 9.05 × 1026),activation of protein kinase activity (GO:0033674, P ¼ 5.3 ×1029) and voltage-gated channel activity (GO:0005245, P ¼1.11 × 10210), which contain genes involved in intracellularsignalling through second messenger systems (SMSs) (32,33).The SMSs are conserved, ubiquitous systems that relaysignals from cell-surface receptors to target molecules in thecytosol or nucleus, and greatly amplify the strength of thesignal, through release of the second messengers: cyclic adeno-sine monophosphate (cAMP); inositol triphosphate (ITP)/diacylglycerol (DAG) and calcium ions (Ca2+) (32–34).

Adenylate cyclase genes (ADCY2, ADCY8, ADCY9) syn-thesize cAMP that activates multiple protein kinases (Fig. 4).

The activity of cAMP is then terminated by the phosphodiester-ase (PDE) enzymes, of which several genes (PDE10A, PDE4B,PDE4D) were identified in either WTCCC or NARAC data setssuggesting control of cAMP (GO:0004016, P ¼ 4.7 × 1025,GO:0004114, P ¼ 9.05 × 1026) as a potential key regulationpoint in RA pathophysiology.

The ITP/DAG system is activated by phospholipase C(PLC), which hydrolyses membrane phospholipids releasingDAG (Fig. 4). DAG recruits calcium-dependent kinases thatactivate numerous pathways (GO:0045859, P ¼ 6.98 ×1029) including RAS, ERK and NFkB signalling, interactingwith the p38, NFAT and STAT signalling pathways (32).The actions of DAG are terminated by the enzyme diacylgly-cerol kinase (DGK). Protein kinase C (PRKC) requirescalcium ions generated from the action of PLC on thecomplex of inositol 1,4,5-trisphosphate (ITP) and DAG. PLCreleases ITP which diffuses through the cytosol and binds toITP receptors (ITPR) on the endoplasmic reticulumcausing the release of Ca2+ into the cytosol (GO:0042325,P ¼ 2.99 × 10210). The identification of several genes inthis system (DGKH, DGKI, PLCG2, PRKCA, PRKCQ,ITPR3) suggest the ITP/DAG signalling is another key regu-latory step in RA (GO:00045859, P ¼ 6.98 × 1029).

A further enriched SMS was the calcium SMS, which is basedon cytosolic Ca2+ triggering numerous cellular processes(34,35). (GO:0005262, P ¼ 1.64 × 1028; GO:0005245, P ¼1.11 × 10211, genes CACNA1A, CACNA1D, CACNA1E,CACNA2D1, CACNA2D3, CACNB2, CACNG5, ITPR3,VAV3). In addition to the release of Ca2+ from the endoplas-mic reticulum, the entrance of Ca2+ into the cytosol from theexterior is also controlled by voltage-gated channels, receptor-operated channels (GO:007166, P ¼ 7.08 × 10223) andG-protein-coupled receptors, with an increase in cytosolicCa2+ activating numerous calcium-dependent proteins inclu-ding PRKC (Fig. 4).

Activation of protein kinase activity (GO:0032147,P ¼ 2.7 × 1027) and activation of phosphorylation(GO:0042325, P ¼ 2.99 × 10210; GO:0001932, P ¼ 8.47 ×10210) were also enriched functional categories. The multipleidentified genes coding for kinases and phosphatases(PRKCA, PRKCQ, PIK3R3, PPP3CA, PTPN22, PTPN5,MAPKAPK2, MAP3K4) are activated by the SMSs (Fig. 4)and regulate multiple cellular functions including cell turnover,metabolism, secretion and growth (33,36,37).

Many of the remaining genes identified in either study thatare not directly involved in receptor-mediated signalling orSMSs are also known to regulate these systems (Fig. 4). Theproteins encoded by VAV3, ALCAM, CD59, ITGB3 interactwith those of EGFR, which interacts with GRB2, leading toRAS signal transduction. EGFR encoded protein also interactswith the SPRY1 protein which inhibits growth factor signal-ling. FGF5 and FGF6 are ligands for FGFR2 which interactswith GRB2. In terms of cytokines and their receptors, theencoded proteins of TNFAIP3, IRF5, SOCS1, SOCS6,STAT4 and PIAS2 are well-known downstream regulators ofcytokine-induced inflammation, as are the encoded proteinsof RAPGEF4, RAPGEF5, RASGRF2, PRKCQ and NFATC4.The biological interactions of all significant genes in bothNARAC and WTCCC studies are shown schematically inFigure 4 as a functional cellular map.

Human Molecular Genetics, 2011, Vol. 20, No. 17 3499

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 7: Pathway-driven gene stability selection of two rheumatoid arthritis

Table 2. Mapping of the union of WTCCC and NARAC selected genes back to the 42 associated pathways

Pathway Number of SG /Number of PG

Selected genes

Bystander B-cell activation 3/8 CD28, HLA-DRA, HLA-DRB1Type I diabetes mellitus 12/45 CD28, HLA-C, HLA-DOB, HLA-DPB1, HLA-DQA1, HLA-DQA2,

HLA-DQB1, HLA-DRA, HLA-DRB1, HLA-G, PTPRN2IL 5 signalling pathway 3/10 HLA-DRA, HLA-DRB1, IL5RAAntigen-dependent B-cell activation 3/12 CD28, HLA-DRA, HLA-DRB1Activation of Csk by cAMP-dependent protein kinase

inhibited signalling via the T-cell receptor2/24 HLA-DRA, HLA-DRB1

T-cell signalling 29/273 CD28, CD247, VAV3, GRAP2∗, TEC∗, ITPR3, DGKH, DGKI∗, PPP3CA,GRB2, RASGRF2, PRKCA, PTPN5, NFATC4, PIK3R3, TNFAIP3,PRKCQ, CDK4, IL6R, IL2RA, IL2RB, IL5RA, CSF2RB, PIAS2, SOCS1,SPRED2, SPRY1, PTPN22∗, RAP1A (VAV3)

Cell-adhesion molecules 19/134 ALCAM∗, CD28, CD40, CLDN10, CLDN23, HLA-C, HLA-DOB,HLA-DPB1, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DRA,HLA-DRB1, HLA-G, ICAM3∗, ICAM4, ICOSLG∗, ITGA4, ITGA9

Eosinophils in the chemokine network of allergy 2/8 HLA-DRA, HLA-DRB1Cytokines and inflammatory response 2/29 HLA-DRA, HLA-DRB1B lymphocyte cell surface molecules 311 ICAM1, HLA-DRA, HLA-DRB1The Co-stimulatory signal during T-cell activation 4/21 CD28, GRB2, HLA-DRA, HLA-DRB1Th1/Th2 differentiation 5/19 CD28, HLA-DRA, HLA-DRB1, IL2RA, IL18R1(IL1RL1)Lck and Fyn tyrosine kinases in initiation of TCR activation 2/13 HLA-DRA, HLA-DRB1Antigen processing and presentation 15/83 HLA-C, HLA-DOB, HLA-DPB1, HLA-DQA1, HLA-DQA2, HLA-DQB1,

HLA-DRA, HLA-DRB1, HLA-G, TAP1∗, TAP2∗, HSPA1A∗

(HSPA1B,HSPA1L), PSMB3, PSMD13∗

Haematopoietic cell lineage 10/88 CD59∗, CSF1R, HLA-DRA, HLA-DRB1, IL2RA, IL5RA, IL6R, ITGA1,IGA4, ITGB3

Notch signalling pathway 0/47 ‘-Proteasome complex 1/17 PSMB3Complement pathway 2/19 C5, C7∗

Classical complement pathway 2/14 C5, C6∗

Natural killer cell-mediated cytotoxicity 15/132 CD247, GRB2, HLA-C, HLA-G, ICAM1, IFNAR2, MICA, MICB, NFATC4,PIK3R3, PLCG2, PPP3CA, PRKCA, SH3BP2∗, SHC4, VAV3

Focal adhesion 13/200 EGFR, FLNB, GRB2, ITGA1, ITGA4, ITGA9, ITGB3, PDGFB, PIK3P3,PRKCA, RAP1A (VAV3), SHC4, VAV3

Cytokine–cytokine receptor interaction 11/257 CD40, CSF1R, CSF2RB, EGFR, IFNAR2, IL2RA, IL2RB, IL5RA, IL6R,PDGFB, IL18R1 (IL1RL1)

Calcium-signalling pathway 12/174 ADCY2, ADCY8, ADCY9, CACNA1A, CACNA1D, CACNA1E, EGFR,ITPR3, PDE1C, PLCG2, PPP3CA, PRKCA

MAPK signalling 25/257 CACNA1A, CACNA1D, CACNA1E, CACNA2D1∗, CACNA2D3∗,CACNB2∗, CACNG5∗, EGFR, FGF14, FGF5, FGF6, FGFR2, FLNB,GRB2, MAP3K4, MAPKAPK2∗, MEF2C∗, NFATC4, PDGFB, PPP3CA,PRKCA, PTPN5, RASGRF2, RAP1A (VAV3)

Glycerophospholipid_metabolism 2/51 DGKH, PLCG2Glycerolipid_metabolism 1/46 DGKHNeuroactive ligand-receptor interaction 0/254 –TLR9 signalling via IRF5 4/17 IL1RL1, SOCS1, TNFAIP3, IRF5Smooth muscle contraction 9/150 ADCY2, ADCY8, ADCY9, ITPR3, PDE4B, PDE4D, PLCG2, PRKCA,

PRKCQPhosphatidylinositol signalling system 4/90 DGKH, PLCG2, PRKCA, PRKCQTight junction 5/136 CDK4∗, CLDN10∗, CLDN23∗, PRKCA, PRKCQPurine metabolism 11/145 ADCY2, ADCY8, ADCY9, ADSS∗, FHIT∗, PAPSS2∗, PDE10A∗, PDE1C,

PDE4B, PDE4D, POLD2Gnrh signalling pathway 9/97 ADCY2, ADCY8, ADCY9, CACNA1D, EGFR, GRB2, ITPR3, MAP3K4,

PRKCAWnt signalling pathway 3/149 NFATC4, PPP3CA, PRKCAGap junction 8/98 ADCY2, ADCY8, ADCY9, EGFR, GRB2, ITPR3, PDGFB, PRKCAValine leucine and isoleucine biosynthesis 1/12 VARS (MICB)Cell communication 0/138 –Regulation of actin cytoskeleton 12/212 EGFR, FGF14, FGF5, FGF6, FGFR2, ITGA1, ITGA4, ITGA9, ITGB3,

PDGFB, PIK3R3, VAV3Ecm receptor interaction 4/87 ITGA1, ITGA4,ITGA9,ITGB3Jak stat signalling pathway 15/153 CSF1R∗, CSF2RB∗, GRB2, IFNAR2, IL2RA, IL2RB, IL5RA, IL6R, PIAS2,

PIK3R3, SOCS1, SOCS6∗, SPRED2, SPRY1, STAT4Taste transduction 3/53 ADCY8, CACNA1A, ITPR3

Continued

3500 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 8: Pathway-driven gene stability selection of two rheumatoid arthritis

The GO enrichment functional analysis, cell-locationmapping and the cellular map we created demonstratefurther concordance between the findings in the two studiesat the functional level. Other than the 21 genes that were ident-ified in both WTCCC and NARAC studies, many of the genes,which were only selected in one study, had functional rolessimilar to genes selected in the other study.

DISCUSSION

We have shown that using both biologically driven pathwayinformation and a robust gene stability-selection methodologyyields improved power and consistent results across two inde-pendent RA cohorts. We have shown that this approach canidentify novel, reproducible associations from GWAS aswell as replicate known associations that were not validatedusing single-SNP approaches using the same GWAS data.For example, the original NARAC report (5), presented the

TRAF1-C5 association and confirmed only the MHC regionand the PTPN22 gene, whereas our approach on NARAC vali-dated the CD247, CD28, CD40, IL2RB, PRKCQ, SPRED2 andTAP2 genes, which have been previously reported as modestassociations with disease (6). Similarly, in comparison to theoriginal WTCCC report (4) that only confirmed the associationof the MHC region and the PTPN22 gene, we also validatedthe associations of STAT4, TNFAIP3, IL2RA, IL2RB, IRF5and IL6R (6). Furthermore, of the seven novel loci discoveredin the recent meta-analysis of RA GWAS on a total of 41 282samples (6), we discovered three (IL2RA, IRF5, SPRED2), andof the 10 suggestive loci, we discovered two (CD247, IL6R)using only two studies with 4798 (WTCCC) and 2168(NARAC) samples each.

Our approach would also be applicable if only one GWAScohort was available, although the calibration of the selectionprobability pthr and expected model size to maximize thenumber of discoveries and minimize the FDR would not bepossible, and we would suggest using the values derived in

Table 2. Continued

Pathway Number of SG /Number of PG

Selected genes

Glycan structures biosynthesis 1 0/117 –

Pathways are in the same order as they appear in Table 1. Column 2 shows the number of selected genes (SG) over the number of pathway genes (PG). Previouslyreported RA associations are underlined. Genes that appear in only 1 of the 42 pathways are followed by an asterisk (∗). Genes without an asterisk act in more than1 of the 42 pathways. Pairs of genes in italics indicate genes that overlap the same selected gene-block, but are in different pathways. The first listed gene in eachpair is from the pathway indicated, while the second listed has a higher selection probability and hence is reported in our final list.

Table 3. The 21 identified genes that replicated in both WTCCC and NARAC and their most selected SNP

Gene Chromosome Start pos End pos Most selectedSNP in block

Allele P-value(meta-analysis)

OR (95% CI)

HLA-DRA 6 32495624 32540802 rs9268853 C 4.56E-109 2.40 (2.20 - 2.60)HLA-DQA1 6 32693160 32739407 rs9272219 T 9.58E-46 0.52 (0.48–0.57)MICA 6 31459349 31511069 rs1063635 A 1.42E-17 0.74 (0.69–0.79)HLA-G 6 29882734 29926878 rs1610677 C 4.12E-15 0.76 (0.71–0.81)IL2RB 22 35831016 35872708 rs743777 G 2.09E-06 1.19 (1.10–1.30)HLA-C 6 31324507 31367834 rs6906846 A 2.26E-05 0.85 (0.79–0.92)EGFR 7 55033429 55128489 rs17172446 A 0.0001 0.86 (0.79–0.93)TEC 4 47812556 47986571 rs2457414 C 0.0016 0.89 (0.83–0.96)PTPN22a 1 114137960 114235857 rs2488457 G 0.0020 1.13 (1.10–1.20)PDE4D 5 58280444 58658855 rs829451 T 0.0030 1.12 (1.10–1.20)SHC4 15 46882243 46992983 rs16961743 T 0.0050 1.40 (1.10–1.80)CACNA1D 3 53474897 53575012 rs3774425 A 0.0078 0.90 (0.85–0.98)CACNA1A 19 13155623 13331522 rs3774425 A 0.0084 0.90 (0.83–0.97)VAV3 1 107894802 108046933 rs10494071 T 0.0091 1.19 (1.00–1.37)CACNB2 10 18578665 18827827 rs1757217 T 0.0132 1.15 (1.00–1.21)PDE1C 7 31775674 31790442 rs992185 A 0.0163 0.80 (0.67–0.96)FGF14 13 101150762 101335129 rs1537930 C 0.0167 1.10 (1.00–1.21)ITGA9 3 37448558 37577446 rs6788347 G 0.0196 1.08 (1.00–1.21)ADCY2 5 7594566 7712781 rs7717535 T 0.0200 0.92 (0.86–0.99)CACNA2D1 7 81573468 81851660 rs38564 T 0.0274 0.92 (0.85–0.99)PDE4B 1 66010515 66545937 rs17128076 A 0.0376 1.23 (1.00–1.50)

HLA-DRA, HLA-DQA1, PTPN22, IL2RB and MICA have shown association in previous studies. The remaining 16 are novel associations cross-validating in bothcohorts. The ORs were calculated with respect to the reported allele. The genes appear in the ascending order of their meta-analysis P-value.aOriginal analysis on NARAC study identified rs2476601 (single SNP P ¼ 2 × 10211) as the most significant SNP mapped to PTPN22 gene. We did not replicatethis result as this SNP lies �68 kb away from the PTPN22 boundaries and we only mapped SNPs to genes within 20 kb of each side of the gene.

Human Molecular Genetics, 2011, Vol. 20, No. 17 3501

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 9: Pathway-driven gene stability selection of two rheumatoid arthritis

this paper. If more than two GWAS cohorts were available, itwould be straightforward to extend our approach using amultivariate hypergeometric distribution to assess the FDR.

Our results suggest that our pathway-based and genestability-selection analysis is robust to genetic heterogeneitybetween different samples and to the technical heterogeneityof different genotyping platforms. We have replicated our find-ings at the pathway level by identifying the same pathways inboth studies lying in the top 6% of the rankings in NARACand at the top 10% in WTCCC; at the gene level by identifying21 genes in common of the 58 and 61 selected in WTCCC andNARAC studies, respectively, and by validating 8 (WTCCC)and 12 (NARAC) known associations that were not validatedusing single-SNP approaches on the same GWAS data; at the

functional level through demonstrating that many genesselected in only one study have functionally similar genesselected in the other study. A recent review (38) commentedthat, at present, there is no clear way to compare variouspathway analysis methods against each other, but there is aneed to replicate association results. To our knowledge, this isthe first combined pathway and gene-based methodology thathas replicated association findings in two real GWAS data sets.

Finally, our functional analysis of the identified genessuggests that the underlying concept of RA as a largely auto-immune and inflammatory disease should be broadened. Ourfindings suggest that defective second messenger signallingthrough cAMP, ITP, DAG and Ca2+ may be at the heart ofRA predisposition. The SMSs are used by all cells to switch

Figure 3. Cellular location of products of the selected genes. Genes in red were selected in both WTCCC and NARAC; in yellow are genes selected only inNARAC; in blue are genes selected only in WTCCC. Pink indicates that the same gene was identified in both studies but a different gene-block in each study.Genes coding for extracellular proteins are predominantly growth factors. Genes coding for intracellular proteins are predominantly involved with signal trans-duction. Lines connecting genes indicate direct interactions as defined from Ingenuity Pathways Analysis knowledge base.

3502 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 10: Pathway-driven gene stability selection of two rheumatoid arthritis

on and off multiple processes including cell cycle, differen-tiation, growth, secretion and immune response (33,34).SMSs regulate both the intensity and duration of cellularresponses and the identification of multiple genetic variantsin the SMSs suggests that RA may arise from a defect in theon/off switch controlling both the amplitude and duration ofkey cellular functions. The SMSs control not only immuneresponse cells but chondrocytes, osteocytes, connectivetissue and blood vessel cells, as well as processes such asosteoclast resorption, differentiation of chondrocyte and osteo-blasts and repair and removal of senescent cells (39,40). Thusa defect in on/off signalling may explain the widespread cellu-lar and connective tissue disorder in RA in addition to abnor-mal function of inflammatory cells.

It is of interest that numerous cellular studies have docu-mented alterations in cAMP, inositol phosphate and calciumsignalling in RA, including abnormalities in these pathwaysin inflammatory cells in the joints, in in vivo studies of osteo-cytes and chondrocytes (40–43), supporting our hypothesis.Future studies should explore the relationship between genesidentified and functional cellular responses to test this hypoth-esis, particularly as all steps in the processes we have impli-cated may be amenable to pharmacological manipulation.Our identification of the SMSs as playing a key role in RAsusceptibility suggests a novel concept of the genetic mechan-isms of the disease which potentially opens new avenuesfor pharmacological intervention to prevent and treat the dis-order.

Figure 4. Cellular and functional map of non-HLA genes identified by stability selection. Purple—selected in both WTCCC and NARAC; blue—selected inWTCCC only; red—selected in NARAC only. Focus is given on the second messenger signalling systems (SMSs). In the cyclic AMP system (cAMP), adenylatecyclase (ADCY) synthesizes cAMP from ATP. ADCY is activated by a range of receptors including G-protein coupled receptors. cAMP activates protein kinases(PKA) that phosphorylate target proteins. The enzymes phosphodiesterase (PDE) terminate the activity of cyclic AMP switching off the process. The inositoltriphosphate/diacylglycerol (ITP/DAG) system is activated by phospholipase C (PLC), which hydrolyses membrane phospholipids releasing DAG and ITP. DAGrecruits calcium-dependent protein kinase C and D (PKC, PKD). These activate numerous pathways including RAS and ERK leading to cell turnover, IKK andactivation of NFKB signalling and interaction with p38 MAP kinase, NFAT and STAT signalling pathways. The actions of DAG are terminated by the enzymediacylglycerol kinase, which converts DAG to phosphatidic acid. PKC requires calcium ions (Ca2+) and these are made available by the action of ITP in thecalcium-signalling pathway. ITP, released from the complex with DAG by PLC, diffuses through the cytosol and binds to inositol triphosphate receptors (ITPR)on the endoplasmic reticulum causing the release of Ca2+ into the cytosol. The calcium second messenger system is activated by release of ITP, which binds tothe ITPR on the reticuloendothelial system triggering release of Ca2+ from the ER into the cytosol. Entrance of Ca2+ into the cytosol is also controlled byvoltage-gated channels, such as CACNAC, ITPR or receptor-operated channels and G-protein-coupled receptors. The rise in Ca2+ concentrations in thecytosol, as a result of influx from the endoplasmic reticulum or extracellular fluids through calcium channels, activates numerous calcium-dependent proteinsincluding PRKC. The actions of Ca2+ are terminated by active transport of Ca2+ out of the cytosol using calcium ATPases or sodium calcium exchanges.

Human Molecular Genetics, 2011, Vol. 20, No. 17 3503

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 11: Pathway-driven gene stability selection of two rheumatoid arthritis

MATERIALS AND METHODS

The flowchart of our proposed methodology is given inFigure 1.

Pathway-based analysis

The CT test statistic (9) assesses the null hypothesis that thepathway does not comprise any genes that are associatedwith the disease. In our original publication (9), the null distri-bution is obtained via fitting a skew normal distribution to1000 CT statistics under permuted phenotype labels, usingthe ‘sn’ package in R. In this work, we observed that the‘sn’ package does not allow calculation of the lower tail ofthe cumulative distribution in log space, which enforces alower bound on the reported significance levels. To addressthis problem, we calculated P-values by fitting a gamma dis-tribution. We used the Kolmogorov–Smirnof test statistic, aswell as visual inspection of QQ plots to determine that thegamma distribution provides as good a fit as the skew-normal.Correction for multiple testing and dependence between path-ways due to shared genes was performed via the method ofBenjamini–Hochberg–Yekutieli, which controls the FDRunder dependency. We set the FDR at a ¼ 0.05. This pro-cedure was applied independently to WTCCC and NARAC.

Gene stability selection

We selected the 2156 genes comprising the 42 pathways sig-nificant in both studies to carry forward to gene-based stabilityselection. SNPs were assigned to genes in a physical distanceof 20 kb for each side of the gene. It has been suggested(27,44) that choosing 20 kb is not entirely arbitrary since arecent study of gene expression showed that the majority ofeQTLs lie within 20 kb of genes (45). Stability selectionrequires a variable selection algorithm with a regularizationparameter which controls the number of variables selected.In the original implementation of stability selection, theLasso variable selection was used (29), these authors thenshowed that a ‘randomized Lasso’ algorithm gave better selec-tion properties. We used the HyperLasso variable selectionalgorithm (21), which has similar properties with the random-ized Lasso [Phil Brown and Jim Griffin contribution to the dis-cussion of Meinshausen and Puhlman (29) paper and theauthor response]. The HyperLasso has two parameters, shapel and penalty f′, the regularization parameter. We fixed theshape parameter at l ¼ 1 (data not shown). We defined agrid of penalty values f′xL,(0,45), in increments of 3. Foreach value of f′xL, we generated 1000 random sub-sampleseach consisting of 50% of the individuals, sampled withoutreplacement; the case–control ration remained constantacross random sub-samples and we generated a set of selectedSNPs for each sub-sample using HyperLasso (21). The upperand lower bounds were chosen to fit models of plausible size(between 1 and �2000 covariates).

We then calculated selection probabilities for each gene. Inorder to avoid spuriously selecting large genes, we first splitgenes into LD based gene-blocks on the basis of recombina-tion rates estimated using LDHat (46) on HapMap Phase IIdata. We assigned every position with recombination rate

. ¼ 15 cM/Mb as the border of an LD region (correspondingto LD-regions with average size 35 kb). The selectionprobability of a gene-block p(g|f′) is defined as the fractionof sub-samples in which at least one SNP in a gene-block gwas selected. The stability path of a gene-block is definedas the selection probability over the range of all penaltyvalues f′xL.

We assume a gamma prior probability for the penalty withshape parameter k and scale parameter u,

G( f ′|k, u) = f ′k−1 1

G(k)uke−f ′/q (1)

with support truncated to lie within the lower and upper boundof the grid L. Therefore the expected probability of inclusionof a gene g, is given by the integral

p(g|k, u) =∫p(g′)G( f ′|k, u)df ′ (2)

which we approximate by

p(g|k, u) ≈

∑f ′[L

p(g′)G(1/f ′|k, u)D f ′

∑f ′[L

G(1/f ′|k, u)D f ′(3)

where Df ′ is the size of the intervals over which we approxi-mate the continuous integrands by their values at f′. In orderto define the parameters of the gamma prior, we notice thatthe model size can be approximated by an exponentialfunction of f′ (Supplementary Material, Fig. S6), and fit themodel

log(size( f ′)) = A + Bf ′ (4)

The expected posterior model size is given by

E[size] =∫1

0

eAeBf ′G( f ′|k, u)df ′

= 1

1 − Bu

( )k

eA

( )∗∫1

0

G f ′|k, u

1 − Bu

( )df ′,

= 1

1 − Bu

( )k

eA

( )(5)

which leads to an expression for the shape parameter k interms of the scale parameter u and the expected model sizeE[size]

k = log(E[size]/eA)log(1/(1 − Bu)) (6)

Furthermore, we sought to equalize the spread of the posteriordistribution of the model size by setting the scale parameter ofthe integrand in Eq. (5) to be the same in both studies, which

3504 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 12: Pathway-driven gene stability selection of two rheumatoid arthritis

leads to an expression of the scale parameter of one study interms of the other.

uWTCCC = uNARAC

1 − uNARAC∗(BNARAC − BWTCCC) (7)

Subsequently, the stably selected gene-blocks are defined asthose with p(g|k,u). ¼ pthr. We calculated results for arange of E[size]x[100,1000] (increment of 100) and pthr

x[0.01,0.9] (increments of 0.1).Throughout we set the mean of the NARAC gamma prior dis-

tribution to be uNARAC / kNARAC ¼ 10, from which for eachE[size] we could calculate uNARAC and kNARAC using Eq. (6),subsequently calculate uWTCCC from Eq. (7) and then kWTCCC

again from Eq. (6). The corresponding prior distributions foreach study and for different expected sizes E[size] are shown inSupplementary Material, Figure S7. The total number of selectedgenes per study for various values of E[size] is shown in Sup-plementary Material, Figure S8. We calculated the number ofselected genes which overlap between the two studies, themedian FDR and its 95% CI among these common genes(Fig. 2A). The FDR was defined as the fraction of overlappinggenes which would have been expected to be selected bychance alone and was calculated using the hypergeometric distri-bution. We defined the FDR in terms of genes rather than gene-blocks as we found this to be more conservative, due to the factthat the number of possible gene-blocks is much larger than thenumber of possible genes, whereas the two studies almostalways selected the same gene-block within a common genefor pthr . 0.2.

Simulating overlap under non-random gene-blockselection

Given the observed bias in the distribution of the number oftyped SNPs (#SNPs) per selected gene block between selectedand non-selected gene blocks, we recalculated the significanceof the observed overlap assuming non-random gene blockselection according to #SNPs. We first calculated the empiri-cal probability of selecting a gene-block with n SNPs as p(n | selected) � #(gene selected gene blocks with n SNPS).We then smoothed this probability distribution by setting p(n | selected) to be equal to its average in a moving 5-SNPwindow. We then calculated the probability of selectingeach given gene-block g, with #SNPs ¼ #SNPs(g) as

P(g|selected) =p(#SNPs(g)|selected)∗1/#(geneblockswith#SNPs = #SNPs(g))

We can then sample two vectors of selected gene-blocks, oflength N1 and N2 corresponding to the number of NARACand WTCCC selected gene-blocks, and finally calculate thenumber of common gene-blocks selected.

Established RA associations

We considered as previously reported and established RAassociations the genes that were listed in the two recent RAGWAS papers (5,47) a subsequent meta-analysis (6) and the

Catalog of Published Genome-Wide Association Studies(http://www.genome.gov/26525384).

GO functional enrichment analysis

We conducted a GO enrichment analysis using GOrilla (31) onthe genes that our gene-selection methodology identified. Theselected genes were compared with all the typed genes on thechip of each study. We carried out this analysis on each studyseparately and we observed that the same functional categorieswere enriched. We repeated the analysis on the union of theidentified genes. The GO enrichment P-values reported in thetext correspond to the joint analysis; the GO P-values for eachstudy separately are shown in Supplementary Material, TablesS3 and S4. We also carried out a sensitivity analysis on theeffect of different pthr on the enriched GO categories. Theresults of GO enrichment based on pthr¼ 0.3 and pthr ¼ 0.5are shown in Supplementary Material, Tables S5–S7, anddemonstrate the robustness of the conclusions relating to func-tional enrichment relative to pthr. In total, 16, 92 and 191 GOenriched process terms were identified for pthr¼ 0.5, pthr¼0.4 and pthr¼ 0.3, respectively. Fifteen of 16 (94%) enrichedprocess terms at pthr¼ 0.5 where also enriched at pthr¼ 0.4;while 89 of 92 (97%) of those were enriched at pthr¼ 0.3.With regard to the GO functional terms, 6, 17 and 37 GO cat-egories were enriched for pthr¼ 0.5, pthr¼ 0.4 and pthr¼ 0.3,respectively. Six of 6 (100%) enriched functional terms atpthr¼ 0.5 were enriched for pthr¼ 0.4; while 94% (16/17) ofthose were enriched at pthr¼ 0.3.

Web resources

The URLs for data presented herein are as follows: KyotoEncyclopedia of Genes and Genomes (KEGG) Database(http://www.genome.jp/kegg/pathway.html), Biocarta (http://www.biocarta.com/genes/index.asp), GenMAPP (http://www.genmapp.org/), Ingenuity Pathways Analysis (http://www.ingenuity.com/).

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG online.

ACKNOWLEDGEMENTS

This study makes use of data generated by the Wellcome TrustCase-Control Consortium. A full list of the investigators whocontributed to the generation of the data is available fromwww.wtccc.org.uk. We thank Dr Peter K. Gregersen for kindlyproviding us with the raw genotype data of the NARAC study.

Conflict of Interest statement. None declared.

FUNDING

This work was supported by the Imperial College Division ofMedicine PhD Scholarship of H.E., as part of her PhD. It wasalso supported by the European Community’s Seventh Frame-work Programme, grant number 223367 (MultiMod). C.J.H. isfunded by the European Union grant HEALTH-2007-201550HyperGenes.

Human Molecular Genetics, 2011, Vol. 20, No. 17 3505

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018

Page 13: Pathway-driven gene stability selection of two rheumatoid arthritis

REFERENCES

1. Imboden, J.B. (2009) The Immunopathogenesis of Rheumatoid Arthritis.Annu. Rev. Pathol. Mech. Dis., 4, 417–434.

2. Smolen, J.S., Aletaha, D., Koeller, M., Weisman, M.H. and Emery, P.(2007) New therapies for treatment of rheumatoid arthritis. Lancet, 370,1861–1874.

3. McInnes, I.B. and Schett, G. (2007) Cytokines in the pathogenesis ofrheumatoid arthritis. Nat. Rev. Immunol., 7, 429–442.

4. The Wellcome Trust Case Control Consortium (2007) Genome-wideassociation study of 14,000 cases of seven common diseases and 3,000shared controls. Nature, 447, 661–78.

5. Plenge, R.M., Seielstad, M., Padyukov, L., Lee, A.T., Remmers, E.F.,Ding, B., Liew, A., Khalili, H., Chandrasekaran, A., Davies, L.R.L. et al.

(2007) TRAF1-C5 as a risk locus for rheumatoid arthritis - A genomewidestudy. N. Engl. J. Med., 357, 1199–1209.

6. Stahl, E.A., Raychaudhuri, S., Remmers, E.F., Xie, G., Eyre, S., Thomson,B.P., Li, Y.H., Kurreeman, F.A.S., Zhernakova, A., Hinks, A. et al. (2010)Genome-wide association study meta-analysis identifies seven newrheumatoid arthritis risk loci. Nat. Genet., 42, 508–U56.

7. Raychaudhuri, S., Remmers, E.F., Lee, A.T., Hackett, R., Guiducci, C.,Burtt, N.P., Gianniny, L., Korman, B.D., Padyukov, L., Kurreeman,F.A.S. et al. (2008) Common variants at CD40 and other loci confer riskof rheumatoid arthritis. Nat. Genet., 40, 1216–1223.

8. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A.,Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A.et al. (2009) Finding the missing heritability of complex diseases. Nature,461, 747–753.

9. Eleftherohorinou, H., Wright, V., Hoggart, C., Hartikainen, A.-L.,Jarvelin, M.-R., Balding, D., Coin, L. and Levin, M. (2009) Pathwayanalysis of GWAS provides new insights into genetic susceptibility to 3inflammatory diseases. PLoS ONE, 4, e8068.

10. Sabatti, C., Service, S.K., Hartikainen, A.L., Pouta, A., Ripatti, S.,Brodsky, J., Jones, C.G., Zaitlen, N.A., Varilo, T., Kaakinen, M. et al.

(2009) Genome-wide association analysis of metabolic traits in a birthcohort from a founder population. Nat. Genet., 41, 35–46.

11. Peng, G., Luo, L., Siu, H., Zhu, Y., Hu, P., Hong, S., Zhao, J., Zhou, X.,Reveille, J.D., Jin, L. et al. (2009) Gene and pathway-based second-waveanalysis of genome-wide association studies. Eur. J. Hum. Genet., 18,111–117.

12. Chapman, J.M., Cooper, J.D., Todd, J.A. and Clayton, D.G. (2003)Detecting disease associations due to linkage disequilibrium usinghaplotype tags: a class of tests and the determinants of statistical power.Hum. Heredity, 56, 18–31.

13. Chapman, J. and Whittaker, J. (2008) Analysis of multiple SNPs in acandidate gene or region. Genet. Epidemiol., 32, 560–566.

14. Wang, T. and Elston, R.C. (2007) Improved power by use of a weightedscore test for linkage disequilibrium mapping. Am. J. Hum. Genet., 80,353–360.

15. Indranil, M., Eleanor, F., Daniel, E.W. and Anbupalam, T. (2009)Association tests using kernel-based measures of multi-locus genotypesimilarity between individuals. Genet. Epidemiol., 34, 213–221.

16. Wang, K. and Abbott, D. (2008) A principal components regressionapproach to multilocus genetic association studies. Genet. Epidemiol., 32,108–118.

17. Gauderman, W.J., Murcray, C., Gilliland, F. and Conti, D.V. (2007)Testing association between disease and multiple SNPs in a candidategene. Genet. Epidemiol., 31, 383–395.

18. Liu, J.Z., McRae, A.F., Nyholt, D.R., Medland, S.E., Wray, N.R., Brown,K.M., Hayward, N.K., Montgomery, G.W., Visscher, P.M., Martin, N.G.et al. (2010) A versatile gene-based test for genome-wide associationstudies. Am. J. Hum. Genet., 87, 139–145.

19. Jorgenson, E. and Witte, J.S. (2006) A gene-centric approach togenome-wide association studies. Nat. Rev. Genet., 7, 885–891.

20. Li, M.Y., Wang, K., Grant, S.F.A., Hakonarson, H. and Li, C. (2009)ATOM: a powerful gene-based association test by combining optimallyweighted markers. Bioinformatics, 25, 497–503.

21. Hoggart, C.J., Whittaker, J.C., De Iorio, M. and Balding, D.J. (2008)Simultaneous analysis of all SNPs in genome-wide and re-sequencingassociation studies. PLoS Genet., 4, e1000130.

22. Elbers, C. (2009) Using genome-wide pathway analysis to unravel theetiology of complex diseases. Genet. Epidemiol., 33, 419–431.

23. O’Dushlaine, C., Kenny, E., Heron, E.A., Segurado, R., Gill, M., Morris,D.W. and Corvin, A. (2009) The SNP ratio test: pathway analysis ofgenome-wide association datasets. Bioinformatics, 25, 2762–2763.

24. Torkamani, A., Topol, E.J. and Schork, N.J. (2008) Pathway analysis ofseven common diseases assessed by genome-wide association. Genomics,92, 265–272.

25. Cantor, R.M., Lange, K. and Sinsheimer, J.S. (2010) Prioritizing GWASresults: a review of statistical methods and recommendations for theirapplication. Am. J. Hum. Genet., 86, 6–22.

26. Baranzini, S. (2009) Pathway and network-based analysis of genome-wideassociation studies in multiple sclerosis. Hum. Mol. Genet., 18, 2078–2090.

27. Holmans, P., Green, E.K., Pahwa, J.S., Ferreira, M.A.R., Purcell, S.M.,Sklar, P., Owen, M.J., O’Donovan, M.C. and Craddock, N. (2009) Geneontology analysis of GWA study data sets provides insights into thebiology of bipolar disorder. Am. J. Hum. Genet., 85, 13–24.

28. Ballard, D., Abraham, C., Cho, J. and Zhao, H. (2010) Pathway analysiscomparison using Crohn’s disease genome wide association studies. BMC

Med. Genom., 3, 25.29. Meinshausen, N. and Buhlman, P. (2010) Stability selection. J. R. Stat.

Soc., 74, 417–473.30. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R.,

Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W., Daly, M.J. et al.

(2007) PLINK: A tool set for whole-genome association andpopulation-based linkage analyses. Am. J. Hum. Genet., 81, 559–575.

31. Eden, E., Navon, R., Steinfeld, I., Lipson, D. and Yakhini, Z. (2009)GOrilla: a tool for discovery and visualization of enriched GO terms inranked gene lists. BMC Bioinform., 10, 48.

32. Merida, I., Avila-Flores, A. and Merino, E. (2008) Diacylglycerol kinases:at the hub of cell signalling. Biochem. J., 409, 1–18.

33. Serezani, C.H., Ballinger, M.N., Aronoff, D.M. and Peters-Golden, M.(2008) Cyclic AMP - Master regulator of innate immune cell function.Am. J. Respir. Cell Mol. Biol., 39, 127–132.

34. Berridge, M.J., Lipp, P. and Bootman, M.D. (2000) The versatility anduniversality of calcium signalling. Nat. Rev. Mol. Cell Biol., 1, 11–21.

35. Clapham, D.E. (2007) Calcium signaling. Cell, 131, 1047–1058.36. Kholodenko, B.N. (2006) Cell-signalling dynamics in time and space.

Nat. Rev. Mol. Cell Biol., 7, 165–176.37. Rozengurt, E. (2007) Mitogenic signaling pathways induced by G

protein-coupled receptors. J. Cell. Physiol., 213, 589–602.38. Wang, K., Li, M. and Hakonarson, H. (2010) Analysing biological pathways

in genome-wide association studies. Nat. Rev. Genet., 11, 843–854.39. de Crombrugghe, B., Lefebvre, V., Behringer, R.R., Bi, W.M., Murakami,

S. and Huang, W.D. (2000) Transcriptional mechanisms of chondrocytedifferentiation. Matrix Biol., 19, 389–394.

40. Deshmukh, K. and Sawyer, B.D. (1977) Synthesis of collagen bychondrocytes in suspension culture - modulation by calcium, 3′-5′-cyclicamp, and prostaglandins - (Arthritis Protein Synthesis Invitro). Proc. Natl

Acad. Sci. USA, 74, 3864–3868.41. Allen, M.E., Young, S.P., Michell, R.H. and Bacon, P.A. (1995) Altered

T-Lymphocyte Signaling in Rheumatoid-Arthritis. Eur. J. Immunol., 25,1547–1554.

42. Souness, J.E., Aldous, D. and Sargent, C. (2000) Immunosuppressive andanti-inflammatory effects of cyclic AMP phosphodiesterase (PDE) type 4inhibitors. Immunopharmacology, 47, 127–162.

43. Zha, Y.Y., Marks, R., Ho, A.W., Peterson, A.C., Janardhan, S., Brown, I.,Praveen, K., Stang, S., Stone, J.C. and Gajewski, T.F. (2006) T cell anergyis reversed by active Ras and is regulated by diacylglycerol kinase-alpha.Nat. Immunol., 7, 1166–1173.

44. Chen, L.S., Hutter, C.M., Potter, J.D., Liu, Y., Prentice, R.L., Peters, U.and Hsu, L. (2010) Insights into colon cancer etiology via a regularizedapproach to gene set analysis of GWAS data. Am. J. Hum. Gene., 86,860–871.

45. Veyrieras, J.B., Kudaravalli, S., Kim, S.Y., Dermitzakis, E.T., Gilad, Y.,Stephens, M. and Pritchard, J.K. (2008) High-resolution mapping ofexpression-QTLs yields insight into human gene regulation. PLoS Genet.,4, 15.

46. McVean, G.A.T., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. andDonnelly, P. (2004) The fine-scale structure of recombination ratevariation in the human genome. Science, 304, 581–584.

47. WTCCC (2007) Genome-wide association study of 14,000 cases of sevencommon diseases and 3,000 shared controls. Nature, 447, 661–678.

3506 Human Molecular Genetics, 2011, Vol. 20, No. 17

Downloaded from https://academic.oup.com/hmg/article-abstract/20/17/3494/2527070by gueston 27 March 2018