information genetic content (igc): a comprehensive discovery platform for disease-gene research...

1
Yun Zhu, Emily Williams, Yuan Tian, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Seq Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008. Table 1. Disease annotation for the 28 identified gene clusters. ABSTRACT We developed Information Genetic Content (IGC), a comprehensive knowledgebase and discovery tool for human genes and genetic disorders research use. IGC comprises three components: the Disease-Association Database (DAD), the Gene Scoring Algorithm (GSA), and the Virtual Panel Library (VPL). The DAD module contains over 400,000 associations between over 17,000 genes and 15,000 Mendelian and complex diseases from both expert-curated and text-mined data. The DAD module also features a hierarchical organization of human diseases using a UMLS- controlled vocabulary, permitting queries at any level of the disease ontology hierarchy. The GSA module aims to prioritize genes for a specific disease of interest. This gene scoring algorithm is distinctive in the way it combines the strength of association and the number of associated diseases to provide an unbiased score for each gene. In conjunction with the DAD module, the GSA module is able to produce a list of ranked genes for one or more diseases at any level of the disease hierarchy. The VPL module generates optimal gene grouping by disease classification using hierarchical-clustering-based network analysis. Genes that are involved in the same pathological pathways are grouped into the same cluster. INTRODUCTION The identification of disease-associated genes is an important step towards understanding disease mechanisms, diagnosis, and therapy for the future. However, due to the complex and distributed nature of the problem, current scientific knowledge is spread out over several overlapping databases maintained by independent groups. It is unclear how to rank gene-disease research associations due to the distributed and dispersed nature of our knowledge. To fill this gap, we developed Information Genetic Content (IGC), a comprehensive knowledgebase and discovery tool for human genes and genetic disorders research use. IGC is unique in two aspects. First, it integrates data from multiple databases into one system. Second, it provides an unbiased scoring algorithm to rank gene-disease research association at any level of the disease ontology hierarchy. METHODS CONCLUSIONS We created a comprehensive, efficient, and informative engine, the IGC, to optimize gene selection given diseases at any level of the disease ontology hierarchy: The DAD organizes diseases into an effective hierarchical structure for lookup, and associate diseases to genes. The GSA ranks genes by clinical relevance, and summarizes the scores for disease at any level of the hierarchy. The VPL efficiently groups genes into pools by disease classifications, and further ranks the genes within clusters by their relative importance to diseases. REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015:bav028. 2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. 3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008: 9:559 For Research Use only. Not for use in diagnostic procedures © 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo Fisher Scientific and its subsidiaries unless otherwise specified. Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com Figure 2. Gene Association Database (DAD) maps genes to diseases DAD contains over 400,000 associations between over 17,000 genes and 15,000 Mendelian and complex diseases from both expert and text-mined data. DAD established gene-disease relationships based on DisGeNET 1 , which scores gene- disease associations according to expert-curated sources (e.g. CTD, CLINVAR, and ORPHANET), predicted data using mouse models, and text-mining of publications. Blue circles: two neurological diseases schizophrenia and bipolar disorder. Green circles: genes associated with these two diseases. The disease association database (DAD) organizes diseases into an effective hierarchical structure for lookup, using disease parent-child relationships established in NIH Unified Medical Language System (UMLS). For any disease in the hierarchical tree, the GSA computes the rank-weighted sum score (RWSS) to summarize the strength of the gene’s association with all of its child diseases (see below). Figure 3. Gene Scoring Algorithm (GSA) Figure 5. Gene clustering identified 28 VPLs that can be well defined by disease classifications. A B Disease Key MeSH Category Description C04 Neoplasms C05 Musculoskeletal Diseases C06 Digestive System Diseases C07 Stomatognathic Diseases C08 Respiratory Tract Diseases C09 Otorhinolaryngologic Diseases C10 Nervous System Diseases C11 Eye Diseases C12 Male Urogenital Diseases C13 Female Urogenital Diseases and Pregnancy Complications C14 Cardiovascular Diseases C15 Hemic and Lymphatic Diseases C16 Congenital, Hereditary, and Neonatal Diseases and Abnormalities C17 Skin and Connective Tissue Diseases C18 Nutritional and Metabolic Diseases C19 Endocrine System Diseases C20 Immune System Diseases Cluster Groups Disease of interest DisGeNET Database Rank-Weighted Sum Score (RWSS) RWSS is an unbiased gene scoring method that accounts for both the strength and number of gene-disease pairs. From the top 5,000 genes that are clinical relevant by GSA, 28 gene clusters were identified using WGCNA algorithm 3 . A) Hierarchical clustering of genes according to their association patterns with 16 high-level MeSH categories relevant to inherited diseases. B) Gene cluster association scores with the 16 MeSH disease categories are shown with p-values. RESULTS Figure 1. Overview of IGC framework Figure 4. Gene Scoring in multiple disease hierarchies Level 1 Level 2 Level 3 Level 4 The GSA module uses RWSS method to prioritize genes for a specific disease of interest. In conjunction with the DAD module, the GSA module is able to produce a list of ranked genes for one or more diseases at any level of the disease hierarchy. Module # Module Color GeneCount Disease Annotation 1 turquoise 530 Nervous System Diseases 2 blue 321 Nutritional and Metabolic Diseases 3 brown 307 Cardiovascular Diseases 4 yellow 280 Digestive System Diseases 5 green 253 Eye Diseases 6 red 250 Skin and Tissue Connective Diseases 7 black 229 Male and Female Urogenital Diseases 8 pink 205 Musculoskeletal Diseases 9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases 10 purple 150 Hemic and Lymphatic Diseases 11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases 12 tan 137 Neoplasms 13 salmon 129 Respiratory Tract Diseases 14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases 15 midnightblue 90 Male Urogenital Diseases; 16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and Pregnancy Complications 17 grey60 76 Stomatognathic Diseases 18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases 19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine System Diseases 20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications 21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases 22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases 23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic Diseases 24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases 25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases 26 orange 36 Immune System Diseases 27 white 35 Endocrine System Diseases 28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases

Upload: thermo-fisher-scientific

Post on 14-Apr-2017

267 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association

Yun Zhu, Emily Williams, Yuan Tian, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Seq Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008.

Table 1. Disease annotation for the 28 identified gene clusters. ABSTRACT We developed Information Genetic Content (IGC), a comprehensive

knowledgebase and discovery tool for human genes and genetic disorders

research use. IGC comprises three components: the Disease-Association

Database (DAD), the Gene Scoring Algorithm (GSA), and the Virtual Panel

Library (VPL). The DAD module contains over 400,000 associations

between over 17,000 genes and 15,000 Mendelian and complex diseases

from both expert-curated and text-mined data. The DAD module also

features a hierarchical organization of human diseases using a UMLS-

controlled vocabulary, permitting queries at any level of the disease

ontology hierarchy. The GSA module aims to prioritize genes for a specific

disease of interest. This gene scoring algorithm is distinctive in the way it

combines the strength of association and the number of associated

diseases to provide an unbiased score for each gene. In conjunction with

the DAD module, the GSA module is able to produce a list of ranked genes

for one or more diseases at any level of the disease hierarchy. The VPL

module generates optimal gene grouping by disease classification using

hierarchical-clustering-based network analysis. Genes that are involved in

the same pathological pathways are grouped into the same cluster.

INTRODUCTION The identification of disease-associated genes is an important step towards

understanding disease mechanisms, diagnosis, and therapy for the future.

However, due to the complex and distributed nature of the problem, current

scientific knowledge is spread out over several overlapping databases

maintained by independent groups. It is unclear how to rank gene-disease

research associations due to the distributed and dispersed nature of our

knowledge. To fill this gap, we developed Information Genetic Content

(IGC), a comprehensive knowledgebase and discovery tool for human

genes and genetic disorders research use. IGC is unique in two aspects.

First, it integrates data from multiple databases into one system. Second, it

provides an unbiased scoring algorithm to rank gene-disease research

association at any level of the disease ontology hierarchy.

METHODS

CONCLUSIONS

We created a comprehensive, efficient, and informative engine, the IGC, to optimize

gene selection given diseases at any level of the disease ontology hierarchy:

• The DAD organizes diseases into an effective hierarchical structure for

lookup, and associate diseases to genes.

• The GSA ranks genes by clinical relevance, and summarizes the scores for

disease at any level of the hierarchy.

• The VPL efficiently groups genes into pools by disease classifications, and

further ranks the genes within clusters by their relative importance to

diseases.

REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of

human diseases and their genes. Database 2015:bav028.

2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic

Acids Res. 2004 Jan 1;32(Database issue):D267-70.

3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC

Bioinformatics 2008: 9:559

For Research Use only. Not for use in diagnostic procedures

© 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo Fisher Scientific and

its subsidiaries unless otherwise specified.

Information Genetic Content (IGC): a comprehensive discovery platform for disease-gene research association

Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com

Figure 2. Gene Association Database (DAD) maps genes to diseases

• DAD contains over 400,000 associations between over 17,000 genes and 15,000 Mendelian

and complex diseases from both expert and text-mined data.

• DAD established gene-disease relationships based on DisGeNET1, which scores gene-

disease associations according to expert-curated sources (e.g. CTD, CLINVAR, and

ORPHANET), predicted data using mouse models, and text-mining of publications. Blue

circles: two neurological diseases – schizophrenia and bipolar disorder. Green circles: genes

associated with these two diseases.

• The disease association database (DAD) organizes diseases into an effective hierarchical

structure for lookup, using disease parent-child relationships established in NIH Unified

Medical Language System (UMLS).

• For any disease in the hierarchical tree, the GSA computes the rank-weighted sum score

(RWSS) to summarize the strength of the gene’s association with all of its child diseases (see

below).

Figure 3. Gene Scoring Algorithm (GSA)

Figure 5. Gene clustering identified 28 VPLs that can be well defined by

disease classifications.

A

B

Disease Key

MeSH Category

Description

C04 Neoplasms

C05 Musculoskeletal Diseases

C06 Digestive System Diseases

C07 Stomatognathic Diseases

C08 Respiratory Tract Diseases

C09 Otorhinolaryngologic Diseases

C10 Nervous System Diseases

C11 Eye Diseases

C12 Male Urogenital Diseases

C13Female Urogenital Diseases and

Pregnancy Complications

C14 Cardiovascular Diseases

C15 Hemic and Lymphatic Diseases

C16Congenital, Hereditary, and Neonatal

Diseases and Abnormalities

C17 Skin and Connective Tissue Diseases

C18 Nutritional and Metabolic Diseases

C19 Endocrine System Diseases

C20 Immune System Diseases

Cluster Groups

Disease of interest

DisGeNET Database

Rank-Weighted Sum Score (RWSS)

RWSS is an unbiased gene scoring method

that accounts for both the strength and number

of gene-disease pairs.

From the top 5,000 genes that are clinical relevant by GSA, 28 gene clusters were identified

using WGCNA algorithm3. A) Hierarchical clustering of genes according to their association

patterns with 16 high-level MeSH categories relevant to inherited diseases. B) Gene cluster

association scores with the 16 MeSH disease categories are shown with p-values.

RESULTS

Figure 1. Overview of IGC framework

Figure 4. Gene Scoring in multiple disease hierarchies

Level 1

Level 2

Level 3

Level 4

• The GSA module uses RWSS method to prioritize genes for a specific disease of interest.

• In conjunction with the DAD module, the GSA module is able to produce a list of ranked

genes for one or more diseases at any level of the disease hierarchy.

Module # Module Color GeneCount Disease Annotation

1 turquoise 530 Nervous System Diseases

2 blue 321 Nutritional and Metabolic Diseases

3 brown 307 Cardiovascular Diseases

4 yellow 280 Digestive System Diseases

5 green 253 Eye Diseases

6 red 250 Skin and Tissue Connective Diseases

7 black 229 Male and Female Urogenital Diseases

8 pink 205 Musculoskeletal Diseases

9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases

10 purple 150 Hemic and Lymphatic Diseases

11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases

12 tan 137 Neoplasms

13 salmon 129 Respiratory Tract Diseases

14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases

15 midnightblue 90 Male Urogenital Diseases;

16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and

Pregnancy Complications

17 grey60 76 Stomatognathic Diseases

18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases

19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine System

Diseases

20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications

21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases

22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases

23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic Diseases

24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases

25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases

26 orange 36 Immune System Diseases

27 white 35 Endocrine System Diseases

28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases