kbd_poster.pptx
TRANSCRIPT
Knowledge Based Discovery: Through Text Mining and Graph Theory
UNC Charlotte Department of Bioinformatics and Genomics
Introduction Methods
Results
Discussion
JSON(Text Mining
Results)
Ontologies• MeSH• ChEBI• NALT• NCBI• Entrez Gene
Agricola
Text Mining Software• NLP• Ontology based• Query development• Co-occurrence
CSV(Intermediate
Files)
Extract Data Load Data Graph Database
Term
ite
References
Christina Stylianou, Bishop Duhon, Walter Clements
StatisticsSociety has become more aware of the connection between diet and human health.¹ The diagnosis of diseases linked to lifestyle choices, such as type 2 diabetes, are increasing at an alarming rate. The number of Americans diagnosed with diabetes increased from 5.6 million in 1980 to 20.9 million in 2011.² An extensive amount of published literature with information on these diseases is contained in PubMed, a database for scientific literature . However, PubMed is comprised of over 24 million citations for scientific articles, with a new article uploaded every minute.³ We aim to make full use of the vast amount of published literature through text mining, a method of literature-based discovery. Linguamatics I2E is a natural language processing (NLP) based text mining platform that we used to extracting information.. We are able to extract explicit relationships to describe the interactions between phytochemicals and genes at the molecular level. Ultimately, our work generates a graph database that can be queried to further investigate the effects of diet on human health.
1. Jensen, K., Panagiotou, G., & Kouskoumvekaki, I. (2014). Integrated Text Mining and Chemoinformatics Analysis Associates Diet to Health Benefit at Molecular Level. PLoS Comput Biol PLoS Computational Biology, 10(1). http://doi.org/10.1371/journal.pcbi.10034322. Number (in Millions) of Civilian, Noninstitutionalized Persons with Diagnosed Diabetes, United States, 1980-2011. (2014, May). Retrieved July 22, 2015, from http://www.cdc.gov/diabetes/statistics/prev/national/figpersons.htm3. Using PubMed. Retrieved July 22, 2015, from http://www.ncbi.nlm.nih.gov/pubmed/4. sBrand, E., & Sandberg, M. (1926). The Lability of the Sulfur in Cystine Derivatives and its Possible Bearing on the Constitution of Insulin.5. Inflammation Is Necessary for Long-Term but Not Short-Term High-Fat Diet–Induced Insulin Resistance. (2011). Diabetes.6. Laville M, N. J. (2009). Diabetes, insulin resistance and sugars. Diabetes.7. Holecek, M. (2015). Ammonia and amino acid profiles in liver cirrhosis: effects of variables leading to hepatic encepalopathy. Nutrition.8. Xu, C. (2015). High expression of NQO1 is associated with poor prognosis in serous overian carcinoma. BMC Cancer.
Graph Statistics
Number of Nodes
36,743
Number of Relationships
979,656
Number of Properties
2,975,438
Diameter 4
Entity Statistics
Plants
9,920
Chemicals
12,557
Genes 11,331
Pathways
2,631
Disease 304
Initial queries on the graph database of plant and disease relationships produce interesting results that show how the two are interrelated. The query revealed a pathway linking diabetes to broccoli(Fig. 3). Broccoli contains sulfur, which is beneficial for the production of insulin inside the human body. Sulfur is a component that is part of the insulin protein, which is responsible for glucose absorption within the bloodstream⁴. The RAG1 gene is known to aid in the control and production of insulin⁵. Furthermore, insulin plays a key part in the development of diabetes mellitus type 2⁶. Our query found a relationship between the sulfur atom and the RAG1 gene which are known to influence insulin levels inside the body.
Hepatic encephalopathy is a brain disorder associated with liver failure and high ammonia concentrations in the liver activated by the breakdown of glutamine. It is linked to the detoxification pathway due to the function of the liver in the body. An integral gene in the detoxification pathway is NQO1. NQO1 is a quinone oxidoreductase is a flavoprotein responsible for the removal of radicals and detoxification of quinones. High ammonia levels result in oxidative stress. Vaccinium corymbosum (blueberry) is an active producer of polyphenolic compounds that act as antioxidants. The NQO1 gene and antioxidant enzyme has been found to be activated by polyphenols. It is still uncertain as to how polyphenols play a role in disease outcome.
Problems that we faced include having to curate results manually in order to refine our patterns to resolve duplicate information and false positives. In addition, gathering the proper ontologies and putting them in the correct format to be used by the text mining software was tedious and time consuming. Our research generates a graph-based system mapping the connections between dietary phytochemicals and genes associated with nutritional and metabolic diseases. Through continued research and refined querying, we can improve the results to better elucidate the molecular pathway. Impending research goals comprise of widening the scope of diseases beyond nutritional and metabolic disorders.
Table 1.- Statistics describing graph database
Table 2.- Statistics describing node types in graph
To explore the pathways by which foods provide benefits to human health, we search scientific literature by developing patterns to find relationships linking disease to diet. The process is described in the workflow diagram above.
Disease Pathway Gene Chemical Plant
Figure 3: Relationships between Brassica oleracea var. italica (broccoli), Sulfur atom, RAG1 gene, and diabetes mellitus. Graph resulted from a shortest path query between broccoli and diabetes mellitus.
Figure 1: Associations between diseases (Iron Overload & Amyloidosis), biological pathways, genes, chemicals, and plants. The query randomly selected two diseases and searched for all associated genes and chemicals.
Figure 2: This database image shows relationships between diseases and the pathways they correspond with. This graph shows the process of how various diseases (red) are related to broccoli. The query specified broccoli as the plant of interest and limited the quantity of diseases to the amount shown. Only the pathways, genes, and chemicals shared between entities are displayed.