multiple phewas view - github pagesaneuraz.github.io/multiphewasview/process_book.pdf · the...

Antoine Neuraz – cs171 – 05/05/2015 1

Multiple PheWAS view

• Overview and motivation.

The field of high-dimensional biomedical data visualization is in constant extension. We would like to focus on a specific type of large scale association studies: Phenome-wide association studies (PheWAS)(1,2). This method is derived from the Genome-Wide association studies (Figure 1) and its aim is to scan all phenotypic data available to find a systematic association with a specific genomic status (inverse of GWAS). PheWAS investigates whether the SNPs associated with a phenotype are also associated with other diagnoses (Figure 1 Panel B).

FIGURE 1: COMPARISON BETWEEN GWAS AND PHEWAS

Therefore, for a selected Single Nucleotide Polymorphism (SNP, common variation of 1 nucleotide in the genome), two groups are composed: one with a specific allele and a control group with other alleles. Thereafter, to search for new associations, all of the phenotypic data (for example, all International Classification of Diseases (ICD) codes) available in the medical records of the patients having the specific allele are screened and compared to those of the control group. The standard way of displaying the results is a Manhattan Plot. In a PheWAS, the Manhattan plot is constructed as follow: a data point is a phenotype (e.g. a disease like myocardial infarction) horizontally, we find the different disease categories and vertically the

cases(ex: systemic sclerosis)

controls

cases DNA controls DNA

compare ALL SNPs to find differences between cases and controls

Genome Wide Association Study(1 Phenotype compared to ALL SNPs)

allele G patients group allele A patients group

compare ALL DIAGNOSIS to find differences between cases and controls

Phenome Wide Association Study(1 SNP compared to ALL Phenotypes)

-log(

P-va

lue)

-log(

P-va

lue)

ICD-10 Codes

A T T G C A A C A T T A C A A C

allele G patients phenotype allele A patients phenotype

chromosomes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 X Y

I21 (myocardial infarction)

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A B

HLA region (ch 6)


degree of association between the diseases and the initial selection criterion (e.g. a SNP). This visualization is very efficient to show the results of a single PheWAS, but it doesn’t scale very well: you cannot review the results of multiple PheWAS at the same time with this kind of visualization. • Related work.

PheWAS results are usually displayed as Manhattan plots as seen in (1,2). Each point represents a phenotype, spread horizontally depending on the category of the phenotype (e.g. cardio-vascular, pulmonary,...) The vertical axis is dedicated to the strength of the association between the data point and the SNP analyzed, the higher the point is, the more significant the association is. Ritchie et al. (3) proposed enhancements to the classic Manhattan plot but they only used static approaches that limited their possibilities. So, we would like to develop an interactive tool allowing the visualization of the results of multiple PheWAS at once. • Questions.

Manhattan plots for visualizing PheWAS results have limitations:

• It is not possible to see the results for more than one PheWAS analysis on the same figure. One manhattan plot corresponds to the analysis of one SNP against all the diagnoses. But, showing the results of more than one SNP on a manhattan plot is not convenient.

• This type of visualization doesn't display the effect size of the association. The strength of the association is distinct from the effect size. Meaning, that an association can be strong but the effect of the SNP on the disease can be small, or vice versa. In other words, we can be very confident that the SNP has an effect on a specific disease but knowing that this effect is small.

This project aims to create a visualization tool that allows the visualization of multiple PheWAS (e.g. multiple SNPs) at the same time. This tool also displays the size of effect along with the strength of association. The goal is to help the researcher to draw new hypotheses based on the similarities between the effects of different SNPs on the same phenotype or from the co-association of different phenotypes with the same SNP for example. • Data.

- Source For demonstration purposes, we used some publicly available data from the PheWAS catalog (4). This catalog regroups the results of multiple PheWAS. It contains the PheWAS results for 3,144 SNPs, for 13,835 patients and a total of 1,358 phenotypes.

- Method.

We selected the top 35 SNPs (e.g. the SNPs with the highest number of significant associations). We computed the similarity distance between the features using the binomial distance. Then, we performed a hierarchical clustering to reorder the SNPs and phenotypes. The Odds-ratios were not reported with their confidence intervals. So we estimated the confidence intervals based on the assumption of a gaussian distribution.(5) For the network view, a co-association was defined as follow: 2 phenotypes were


linked if they had a significant (after correction for multiple testing) association with the same SNP. The number of SNPs with a co-association defined the weight of the link.

- Software Data pre-processing was realized using R statistical software version 3.1.2. The following packages were used:

• vegan_2.2-1 (vegdist for the binomial distance) • reshape2_1.4.1 (data wrangling) • PheWAS_0.9.6 (for the phenotype categories) • rjson_0.2.15 (for the export in json)

• Design evolution.

- Initial design

The initial design was a 4 view dashboard (Figure 2) with: - a heatmap to display the size of effect for every SNP and phenotype, - a forest plot to show the details of the effects sizes for a given phenotype but

across all SNPs, - a manhattan plot to show the association strength for all the phenotypes on a

given SNP. - a details view to show the details of a single phenotype association on a

single SNP.

Phenotype 1Phenotype 2Phenotype 3Phenotype 4Phenotype 5Phenotype 5Phenotype 6Phenotype 7Phenotype 8Phenotype 9Phenotype 10Phenotype 11Phenotype 12Phenotype 13

PheW

AS 1

PheW

AS 1Ph

eWAS 2

PheW

AS 3Ph

eWAS 4

PheW

AS 5Ph

eWAS 6

PheW

AS 7Ph

eWAS 8

PheW

AS 9Ph

eWAS 1

0

0

1

2

3

4

1

PheWAS 2

Phenotype 6

PheW

AS 1

PheW

AS 1Ph

eWAS 2

PheW

AS 3Ph

eWAS 4

PheW

AS 5Ph

eWAS 6

PheW

AS 7Ph

eWAS 8

PheW

AS 9Ph

eWAS 1

0

FIGURE 2: INITIAL DASHBOARD DESIGN


In the heatmap, the color hue (red/blue scale for color blinds) encodes whether the mutation increases or decreases the odds of having of the disease. And the color saturation encodes the value of this effect. In the forest plot, the value of the effect is double encoded by the vertical position and the color hue. In the manhattan plot, the horizontal position encodes the strength of association (-log(p-value)) and the category of the phenotype is encoded by the vertical position and the color of the data point. Interactivity of the initial design: a click on a row of the heatmap changes the forest plot, a click on a column changes the manhattan plot, and a click on a cell of the heatmap changes the both, plus the details view. The figure 3 (Figure 3) shows the implementation of the initial design (without the details view).

FIGURE 3: INITIAL DESIGN IMPLEMENTATION


- Intermediate design

FIGURE 4: INTERMEDIATE DESIGN IMPLEMENTATION

The main change compared to the initial design is the disappearance of the forest plot.(Figure 4) Well, it did not really disappeared but is now embedded into the heatmap. It now appears when the user clicks on a cell or a row name of the heatmap, or on a circle of the manhattan plot. (Figure 5)

FIGURE 5: INLINE FOREST PLOT

The color scale of the heatmap changed also: now when no data are available for a given cell, it appears in white. We also added a legend for the color scale. The manhattan plot slightly evolved also: the background color for the phenotypes categories is lighter and the circles lost their stroke (when not selected) and are now transparent to better show the density of circles at a given position. The details view was added with an attempt of showing the Odds-ratio using pictograms of people. For example, for an odds ratio of 2, the number of dark people was 2 times the number of light people.


- Final design We replaced the details view by a tooltip appearing whenever the mouse is over a cell in the heatmap or a circle in the manhattan plot. (Figure 6) This tooltip contains all the information from the detail view but we decided to remove the pictural Odds-ratio.

FIGURE 6: TOOLTIP

We also decided to add a new view to the project. The idea was to represents the links between different phenotypes or different SNPs depending on their associations with SNPs or phenotypes. We included a co-association network in which each node is a phenotype (resp. a SNP) and the links represent an association to the same SNP (resp. the same phenotype). (Figure 7) The weight of the link depends on the number of shared associations. The color of the nodes depends on the category of the phenotypes or on the genes related to the SNPs.


FIGURE 7: FINAL DESIGN IMPLEMENTATION


• Implementation.

The tool is organized in two different screens. The first screen holds the effect size map (heatmap) and the association strength map (manhattan plot) and the second screen shows the co-association network. To navigate between the 2 screens, the user can either click on one of the floating View buttons (Figure 8) or directly scroll down to the second screen (If the user choses to scroll, the buttons are automatically updated).

FIGURE 8: VIEW BUTTONS

To help the user to understand the different views, a tooltip appears when the view titles are hovered. (Figure 9)

FIGURE 9: TITLE TOOLTIP

- Screen 1 (heatmap and manhattan plot) The heatmap holds the main part of the interactivity on the first screen. When the user clicks on a row, the corresponding forest plot appears inline with a nice unfolding transition.(Figure 5) When the user clicks on a new row, the previous row is closed and the new one is opened. The heatmap legend and control buttons move with the opening or closing of the forest plots.

The rows of the heatmap can be reordered by alphabetical order of hierarchical clustering. (The hierarchical clustering if pre-computed in the R script)

When a new column is selected, the manhattan plot is updated to the new SNP. In the manhattan plot, a click on a circle opens (or clothes) the corresponding forest plot in the heatmap. If the clicked circle corresponds to a phenotype not represented in the heatmap (only the phenotypes with at least one significant association are displayed in the heatmap), the corresponding forest plot appears


above all the lines.(Figure 10)

FIGURE 10: INLINE FOREST PLOT FOR AN OTHER PHENOTYPE

Hovering a data point in the heatmap or in the manhattan plot displays a tooltip with the details corresponding to the selected phenotype and SNP.(Figure 6) The position of the tooltip is adjusted dynamically depending on the horizontal position to avoid an out of bound display. (Figure 11)

FIGURE 11: ADJUSTMENT OF THE TOOLTIP POSITION

It also highlights the corresponding data point in the other view. (Figure 12)


FIGURE 12: DATA POINT HIGHLITING

- Network (Figure 13)

FIGURE 13: CO-‐ASSOCATION NETWORK


The user can choose between the 2 projections (phenotypes or SNPs) for the co-association network by clicking on the corresponding icon. When a node is hovered, it is highlighted as well as its linked nodes.

FIGURE 14: NODE HIGHLIGHTING

The user can tweak the position of a node by dragging it to a position of its choice. A stroke appears when the node has been sticked by the user. A double click allows the node to recover its position in the force directed layout. (Figure 15)

FIGURE 15: STICKED NODES


• Evaluation

This visualization does a good job to help visualizing the interactions between the different phenotypes and SNPs. It is easy to find clusters of SNPs and phenotypes and to assess the strength of these associations. For example, we can see a group of cardiovascular diseases grouped together with a group a SNPs that decrease their odds. In the network view we can see that this group of cardiovascular diseases is linked to a neurological phenotype (Neurological disorders due to brain damage) which make perfect sense given that this kind of disorders is usually due to a stroke leaded by cardiovascular problems.

But this visualization has limitations. The main one is probably the limited number of SNPs that can be displayed on the heatmap. It will be very difficult to go beyond 100 SNPs, knowing that there are more than 3,000 SNPs in the PheWAS catalog for example.

The next step would be to allow the user to upload its own dataset to explore it. To do this we will need to integrate R with D3. Some tools exist to allow R to act as an API and communicate with javascript this way.


References 1. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010 May 1;26:1205–10.

2. Neuraz A, Chouchana L, Malamut G, Le Beller C, Roche D, Beaune P, et al. Phenome-wide association studies on a quantitative trait: application to TPMT enzyme activity and thiopurine therapy in pharmacogenomics. PLoS Comput Biol. 2013 Dec;9(12):e1003405.

3. Pendergrass SA, Brown-Gentry K, Dudek SM, Torstenson ES, Ambite JL, Avery CL, et al. The use of phenome-wide association studies (PheWAS) for exploration of novel genotype-phenotype relationships and pleiotropy discovery. Genet Epidemiol. 2011 Jul;35:410–22.

4. PheWAS Catalog [Internet]. [cited 2015 Apr 3]. Available from: http://phewas.mc.vanderbilt.edu/

5. Altman DG, Bland JM. How to obtain the confidence interval from a P value. BMJ. 2011 Aug 8;343:d2090.

multiple phewas view - github pagesaneuraz.github.io/multiphewasview/process_book.pdf · the...

Documents