challenge 26m - genomenonmastermind: an automated genomic knowledge harvesting and data...

1
Mastermind: An Automated Genomic Knowledge Harvesting and Data Prioritization Tool to Facilitate Analysis of Large-Scale Genomic Data Genome sequence analysis requires identification of informative disease-gene-variant associations for extraction of clinico-biological meaning from patient data. The accuracy and efficiency of this analysis is limited by inaccessibility and non-uniformity of genomic information. Here we describe MASTERMIND - a novel analytic tool that reduces the time and effort required to organize and integrate genomic information from any data source including millions of full-text scientific articles and dozens of heterogeneous variant databases. For this work, curated lists of diseases and genes comprising 11.7K and 50.9K total entries were used as initial query parameters. Custom-designed algorithms were used to generate comprehensive variant query lists comprising 602M total entries sorted by biological outcome and used as second-tier queries. An automated querying architecture was designed using customized open-source analytics engines and a combination of publicly available and custom-developed APIs. Using titles and abstracts of 26M primary articles, we identified 909K putative disease-gene associations (average 24 articles per association) that were then confirmed by scanning 3.3M full-text articles prioritized based on content. This information was then organized according to the strength of the association based on the total number and quality of individual citations and the position of disease-gene-variant keywords within the text. Integrated metadata for each finding was used to further prioritize disease-gene-variant associations in accordance with the abundance and quality of supporting evidence. These associations were then displayed with all relevant information from the primary source material used to drive prioritization including interactive access to annotated full-text articles. We have devised MASTERMIND to rapidly and automatically harvest genomic information from disparate data sources including full-text scientific articles and external databases of genomic variants. This tool rapidly and comprehensively interrogates, organizes and displays genome data and has promising applications in expediting tertiary analysis of human genome sequencing data in clinical assays of individual patients. Mark J Kiel MD PhD 1 , Nathan Patel 2 , Richard W Peng 3 , Steve A Schwartz 1 1 GENOMENON, Ann Arbor MI; 2 University of Michigan Medical School, Ann Arbor MI; 3 AlfaJango, Ann Arbor MI Contact us at [email protected] CHALLENGE SOLUTION MASTERMIND Landing Page Mastermind can be queried by searching either by disease to identify associated genes or by gene to identify all associated diseases. Synonyms for diseases and genes are recognized within the query window. Specific mutations within any given gene can also be queried to identify all articles or databases describing that specific mutation. Overview Page Every possible combination of disease-gene association is screened during assembly of Mastermind. The results are prioritized by the number of articles containing any specific disease-gene association. For the BRAF example shown, association with Melanoma, Neuroectodermal tumors, Gastrointestinal carcinoma as well as thyroid malignancy were readily identified. Association Page For any disease-gene association, all relevant articles are displayed in list-form prioritized according to the location of the keywords within the text. The landscape of articles containing all selected keywords is displayed by publication date and citation index. Each mutation identified in any of the articles containing the disease-gene association are displayed. Detail Page Every mention of any disease or gene or its corresponding synonym is recognized and highlighted for easy identification of high-yield information within the text. Mutations are identified and highlighted whether described as nucleotide level changes or protein level changes. Sentence fragments containing mention of any specific mutation are extracted and displayed. DIVERSE COHORTS IMPROVED OUTCOMES INEXPENSIVE SEQUENCING ATGC GTAC CATG TGCA MELANOMA BRAF p.V600E PUBLICATIONS CURATION CONTENT DISEASE-GENE ASSOCIATIONS DISEASE-GENE ASSOCIATIONS DISEASE-VARIANT ASSOCIATIONS QUALITY CONTROL + NATURAL LANGUAGE PROCESSING PRIORITIZED PDF DOWNLOAD PDF CONVERSION KEY TERMS DIAGNOSTIC THERAPEUTIC FUNCTIONAL CUSTOM ETC. DISEASE LIST GENE LIST GENETIC KEY TERMS COMPREHENSIVE VARIANT LIST ARTICLE FULL-TEXT PUBMED TITLES/ABSTRACTS BRAF p.L597Q BRAF p.V600K BRAF p.V600E KRAS p.G12V KRAS p.G12S p.V600E p.V600K p.L597Q p.G12V p.G12S BRAF Melanoma MILLIONS OF PUBLICATIONS USER-DEFINED SEARCHES KRAS Database Assembly. Titles and abstracts of the every article in PubMed are scanned to determine whether a disease or a gene or a key-term relevant to clinical genetics is mentioned. The PDFs of theses articles are downloaded and the full-text converted to searchable text. The full-text is then scanned using the disease and gene lists described above. When a gene is identified in the full-text, variant search is invoked to identify any variant in that gene mentioned in the text in any way that an author may describe it - either using Human Genome Variant Society (HGVS) nomenclature or any of dozens of non-standard formats. Scans for additional key terms that may indicate the data contained in the article is useful for making clinical decisions (such as diagnosis, prognosis or therapy-selection) are performed to further categorize and prioritize the data. Once variant and key term scans are completed, disease-gene and disease-variant association data is passed through a quality control process and natural language processing to organize and store the data in the final database. This process is repeated weekly to remain up-to-date. Process Overview Mastermind continually updates as new articles are published or new diseases are requested. Automated data processing identifies every mention of a disease-gene or disease-gene-variant association. Users are then enabled to search by disease, gene and/or variant to identify all of the most relevant content. Content Curation The challenge of interrogating this amount of information is further complicated by the complexity of the data itself and the relatively infrequenct use of standaradized nomenclature to describe both genetic variants and their association with clinical scenarios that can be described in a variety of different ways. Moreover, the associations between genetic biomarkers and diseases or other clinical phenomena are often difficult to codify, necessitating close examination of the primary evidence itself. Translate genomic knowledge from the medical literature into clinical insight to drive diagnostic decisions Identification of meaningful content based on scans of the titles and abstracts Variant detection using custom informatics technique and comprehensive database of all hypothetical genetic variants produced by in silico mutagenesis Annotation of disease-gene-variant association data and organization into clinically-meaningful categories Cloud-based software providing access to a comprehensive database of millions of disease-gene and disease-gene-variant associations systematically extracted from millions of full-text scientific articles Clinical Need Increasingly diverse cohorts of patients stand to benefit from next-generation sequencing assays that may inform increased diagnostic rates, enhanced prognoses, and improved outcomes especially as more genetic biomarkers are linked with targeted therapeutics. With dramatic decreases in costs associated with data production, the major bottleneck to making maximal use of this new technology in clinical practice is now the significant challenge of accurate and effecient data interpretation. Databases Comprehensive disease lists were developed using Medical Subject Heading (MeSH) Terms. Comprehensive gene lists with synonyms were developed using HUGO Gene Nomenclature Committee (HGNC) database of human gene names. Ancillary lists of useful clinical category terms were custom-developed. 3M Content An automated process to identify prioritized content based on scans of titles and abstracts containing diseases or gene names through the eutils PubMed API culminating in full-text download and scanning was developed on a custom informatic framework and resulted in an initial corpus of 3.3M full-text articles. Hypothetical Variants comprise the backbone of the database used to scan every word of each article to identify varaints at the cDNA and protein level. Disease-Gene Associations were identified using this data processing architecture along with any associated variants within each of the identified genes. 909K 602M Content Coverage Initial focus on oncology has identified 1.4M cancer-associated articles describing tumor-suppressors and oncogenes - the vast majority of which have been full-text processed. Comparison with COSMIC Preliminary comparison to the Catalogue of Somatic Mutations in Cancer (COSMIC) has demonstrated enhanced sensitivity of variant idenification and between 5-20x more citations. www.genomenon.com Interested in learning more? 26M Articles in PubMed One of the most significant barriers to routinizing the interpretation of genetic data is that the information needed to adequately interpret these data is contained within decades of medical knowledge dispersed across many millions of primary scientific articles. Automated full-text download and data processing to identify genes and diseases (and associated synonyms)

Upload: others

Post on 28-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHALLENGE 26M - GenomenonMastermind: An Automated Genomic Knowledge Harvesting and Data Prioritization Tool to Facilitate Analysis of Large-Scale Genomic Data Genome sequence analysis

Mastermind: An Automated Genomic Knowledge Harvesting and Data Prioritization Tool to Facilitate Analysis of Large-Scale Genomic Data

Genome sequence analysis requires identification of informative disease-gene-variant associations for extraction of clinico-biological meaning from patient data. The accuracy and efficiency of this analysis is limited by inaccessibility and non-uniformity of genomic information. Here we describe MASTERMIND - a novel analytic tool that reduces the time and effort required to organize and integrate genomic information from any data source including millions of full-text scientific articles and dozens of heterogeneous variant databases. For this work, curated lists of diseases and genes comprising 11.7K and 50.9K total entries were used as initial query parameters. Custom-designed algorithms were used to generate comprehensive variant query lists comprising 602M total entries sorted by biological outcome and used as second-tier queries. An automated querying architecture was designed using customized open-source analytics engines and a combination of publicly available and custom-developed APIs.

Using titles and abstracts of 26M primary articles, we identified 909K putative disease-gene associations (average 24 articles per association) that were then confirmed by scanning 3.3M full-text articles prioritized based on content. This information was then organized according to the strength of the association based on the total number and quality of individual citations and the position of disease-gene-variant keywords within the text. Integrated metadata for each finding was used to further prioritize disease-gene-variant associations in accordance with the abundance and quality of supporting evidence. These associations were then displayed with all relevant information from the primary source material used to drive prioritization including interactive access to annotated full-text articles. We have devised MASTERMIND to rapidly and automatically harvest genomic information from disparate data sources including full-text scientific articles and external databases of genomic variants. This tool rapidly and comprehensively interrogates, organizes and displays genome data and has promising applications in expediting tertiary analysis of human genome sequencing data in clinical assays of individual patients.

Mark J Kiel MD PhD1, Nathan Patel2, Richard W Peng3, Steve A Schwartz1

1GENOMENON, Ann Arbor MI; 2University of Michigan Medical School, Ann Arbor MI; 3AlfaJango, Ann Arbor MI

Contact us at [email protected]

CHALLENGE

SOLUTION

MASTERMIND

Landing Page Mastermind can be queried by searching either by disease to identify associated genes or by gene to identify all associated diseases. Synonyms for diseases and genes are recognized within the query window. Specific mutations within any given gene can also be queried to identify all articles or databases describing that specific mutation.

Overview Page Every possible combination of disease-gene association is screened during assembly of Mastermind. The results are prioritized by the number of articles containing any specific disease-gene association. For the BRAF example shown, association with Melanoma, Neuroectodermal tumors, Gastrointestinal carcinoma as well as thyroid malignancy were readily identified.

Association Page For any disease-gene association, all relevant articles are displayed in list-form prioritized according to the location of the keywords within the text. The landscape of articles containing all selected keywords is displayed by publication date and citation index. Each mutation identified in any of the articles containing the disease-gene association are displayed.

Detail Page Every mention of any disease or gene or its corresponding synonym is recognized and highlighted for easy identification of high-yield information within the text. Mutations are identified and highlighted whether described as nucleotide level changes or protein level changes. Sentence fragments containing mention of any specific mutation are extracted and displayed.

D I V E R S E C O H O R T S I M P R O V E D O U T C O M E SI N E X P E N S I V E S E Q U E N C I N G

AT G C G TA C C AT G T G C A

MELANOMA

BRAF

p.V600E

P U B L I C AT I O N S C U R AT I O NC O N T E N T

D I S E A S E - G E N E A S S O C I AT I O N SD I S E A S E - G E N E A S S O C I AT I O N S

D I S E A S E - VA R I A N T A S S O C I AT I O N S

Q U A L I T Y C O N T R O L +N AT U R A L L A N G U A G E

P R O C E S S I N G

P R I O R I T I Z E DP D F D O W N L O A D

P D F C O N V E R S I O N

K E Y T E R M S

D I A G N O S T I CT H E R A P E U T I CF U N C T I O N A L

C U S T O ME T C .

D I S E A S E L I S TG E N E L I S T

G E N E T I C K E Y T E R M SC O M P R E H E N S I V E

VA R I A N T L I S T

A R T I C L E F U L L - T E X TP U B M E D

T I T L E S / A B S T R A C T S

BRAFp.L597Q

BRAFp.V600K

BRAFp.V600E

KRASp.G12V

KRASp.G12S

p.V600E

p.V600K

p.L597Q

p.G12V

p.G12S

BRAF

Melanoma

M I L L I O N S O F P U B L I C AT I O N S U S E R - D E F I N E D S E A R C H E S

KRAS

Database Assembly. Titles and abstracts of the every article in PubMed are scanned to determine whether a disease or a gene or a key-term relevant to clinical genetics is mentioned. The PDFs of theses articles are downloaded and the full-text converted to searchable text. The full-text is then scanned using the disease and gene lists described above. When a gene is identified in the full-text, variant search is invoked to identify any variant in that gene mentioned in the text in any way that an author may describe it - either using Human Genome Variant Society (HGVS) nomenclature or any of dozens of non-standard formats. Scans for additional key terms that may indicate the data contained in the article is useful for making clinical decisions (such as diagnosis, prognosis or therapy-selection) are performed to further categorize and prioritize the data. Once variant and key term scans are completed, disease-gene and disease-variant association data is passed through a quality control process and natural language processing to organize and store the data in the final database. This process is repeated weekly to remain up-to-date.

Process Overview Mastermind continually updates as new articles are published or new diseases are requested. Automated data processing identifies every mention of a disease-gene or disease-gene-variant association. Users are then enabled to search by disease, gene and/or variant to identify all of the most relevant content.

Content Curation The challenge of interrogating this amount of information is further complicated by the complexity of the data itself and the relatively infrequenct use of standaradized nomenclature to describe both genetic variants and their association with clinical scenarios that can be described in a variety of different ways. Moreover, the associations between genetic biomarkers and diseases or other clinical phenomena are often difficult to codify, necessitating close examination of the primary evidence itself.

Translate genomic knowledge from the medical literature into clinical insight to drive diagnostic decisions

Identification of meaningful content based on scans of

the titles and abstracts

Variant detection using custom informatics technique

and comprehensive database of all hypothetical

genetic variants produced by in silico mutagenesis

Annotation of disease-gene-variant association data and

organization into clinically-meaningful

categories

Cloud-based software providing access to a

comprehensive database of millions of disease-gene and

disease-gene-variant associations systematically extracted from millions of full-text scientific articles

Clinical Need Increasingly diverse cohorts of patients stand to benefit from next-generation sequencing assays that may inform increased diagnostic rates, enhanced prognoses, and improved outcomes especially as more genetic biomarkers are linked with targeted therapeutics. With dramatic decreases in costs associated with data production, the major bottleneck to making maximal use of this new technology in clinical practice is now the significant challenge of accurate and effecient data interpretation.

Databases Comprehensive disease lists were developed using Medical Subject Heading (MeSH) Terms. Comprehensive gene lists with synonyms were developed using HUGO Gene Nomenclature Committee (HGNC) database of human gene names. Ancillary lists of useful clinical category terms were custom-developed.

3MContent An automated process to identify prioritized content based on scans of titles and abstracts containing diseases or gene names through the eutils PubMed API culminating in full-text download and scanning was developed on a custom informatic framework and resulted in an initial corpus of 3.3M full-text articles.

Hypothetical Variants comprise the backbone of the database used to scan every word of each article to identify varaints at the cDNA and protein level.

Disease-Gene Associations were identified using this data processing architecture along with any associated variants within each of the identified genes.

909K 602M

Content Coverage Initial focus on oncology has identified 1.4M cancer-associated articles describing tumor-suppressors and oncogenes - the vast majority of which have been full-text processed.

Comparison with COSMIC Preliminary comparison to the Catalogue of Somatic Mutations in Cancer (COSMIC) has demonstrated enhanced sensitivity of variant idenification and between 5-20x more citations.

www.genomenon.comInterested in learning more?

26MArticles in PubMed One of the most significant barriers to routinizing the interpretation of genetic data is that the information needed to adequately interpret these data is contained within decades of medical knowledge dispersed across many millions of primary scientific articles.

Automated full-text download and data processing to

identify genes and diseases (and associated synonyms)