greene bosc2008

BOSC 2008 Lightning Talk:

The Enteropathogen Resource Integration Center (ERIC), A NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-

emerging Infectious Diseasehttp://www.ericbrc.org

D. Pot1, J. Whitmore1, M. Shaker1, J. Fedorko1, K. Joshi1, S. Nanan1, P. Shetty1, J. Thangiah1, S. Zaremba1, G. Plunkett, III2, J. Glasner2, B. Anderson2, D. Baumler2, B. Biehl2, V. Burland2, E. Cabot2,

E. Neeno-Eckwall2, B. Mau2, P. Liss2, M. Rusch2, F. R. Blattner2, N. T. Perna2, J. M. Greene1

1SRA International, Inc., Rockville MD and2University of Wisconsin, Madison WI

http://www.ericbrc.org/

• ERIC is a NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Disease, one of 8 such centers funded in July 2004 for 5 years.

• ERIC primarily focuses on the integration of data from five enteropathogens as well as related reference organisms:– Diarrheagenic E. coli– Shigella spp.– Salmonella spp.– Yersinia enterocolitica– Yersinia pestis

• Partnership between personnel at the Genome Center of Wisconsin (Nicole Perna, Fred Blattner, Guy Plunkett) and SRA International’s Global Health Sector, Rockville MD.

• Everything done under contract funding required to be made freely available to the Scientific Community.

ERIC-Overview

Genomes Annotations (ASAP)

Genome Views and Comparisons (Mauve, GBrowse) Microarray Analysis (mAdb)

ERIC is a portal based system using the JBoss portal. ASAP (A Systematic Annotation Package for community annotation) from UW-Madison is being used to allow the scientific community to annotate genes for the five enteropathogens and related reference organism useful for comparative genomics.

ERIC Portal Home Page

• ERIC contains tools for comparative genomics, such as Mauve, which has the distinct advantage of allowing comparison of more than two genomes, as well as being able to handle chromosomal rearrangements. (We provide access to some other pathogenic and non-pathogenic reference genomes, particularly for E. coli.)

Comparative Genome Analysis in Mauve

7111778-028

Mauve – whole genome comparison

• Mauve identifies and aligns regions of local collinearity called locally collinear blocks (LCBs). Each locally collinear block is a homologous region of sequence shared by two or more of the genomes under study, and does not contain any rearrangementsof homologous sequence.

• The Mauve genome alignment procedure results in a global alignment of each locally collinear block that has sequence elements conserved among all the genomes under study. Nucleotides in any given genome are aligned only once to other genomes,suggesting orthology among aligned residues. Mauve makes no attempt to align paralogous regions.

• The remaining unaligned regions may be lineage-specific sequence or rearranged or paralogous repetitive regions and can be identified as suchduring subsequent processing with other tools.

• Available at: http://gel.ahabs.wisc.edu/mauve/download.php

Mauve

• SRA is an industry leader in natural language processing (NLP)-based text mining

• Dedicated group of linguists and software engineers• Routinely win Government text mining competitions (e.g. Message

Understanding Competitions (MUC))

• Extensive experience in multilingual information extraction, text clustering, and text summarization – this is not just keyword searching.

• Numerous commercial and government clients/applicationsHealth care organizations (fraud detection); Financial services (anti-money laundering, e-mail surveillance); Government (homeland security, e-

Government, business intelligence)

• See Poster S04, Extraction of Facts and Relationships Relevant to Molecular Mechanisms of Bacterial Pathogenesis through Natural Language Processing, for details on how this is used in ERIC!

Text Mining

•Latest ArticlesLatest Articles tab – we present the mined extracts on enteropathogens for the preceding week.

Text Mining – Current Awareness

• Search TabSearch Tab – allows users to search across extracted data

• Unlike Latest ArticlesLatest Articles, this is not limited to our contract enteropathogens, and should be useful across all bacteria.

Text Mining - Search

• Currently, in addition to processing all new PubMed abstracts weekly, we are extracting about 4-5,000 abstracts per night, and have extracted all PubMed abstracts back about four years.

• We intend to go back at least 10 years….

• No reason this cannot be applied to Open Access full length text; also will provide Web Services access to extracted data in near future…

• Search Results:

Text Mining – Search Results

Extracted Termsand Relationships(requirements from Biologists)

Frequency of Termsand Coloration Control(requirements from IT Types)

Extracted Text from PubMed

What is different?

• Hundreds of abstracts to read (days)

• Limited, keyword searching

• Data handling complex (stacks of paper)

• Slower ability to reach conclusions

• Quick summary provided (seconds)

• Enhanced role searching

• Knowledge base with links to details

• Faster conclusions through mining of extracted data

Before Now

Other ERIC Notes• Again, everything we do under the contract must

be made freely available to the Scientific Community – all SRA’s work is available under the MIT License, and components from UW are under the GNU GPL.

• Posters - Monday evening session:– I-04 on ERIC System (John Greene)– S-04 on Text Mining (David Pot)

• For more information, contact [email protected]. • ERIC is supported via NIAID contract HSN266200400040C.

http://www.ericbrc.org

mailto:[email protected]

NetOwl ExtractorOptimizing Manual Literature Annotation

Scientist

ERIC-BRC System

ERICDatawarehouse

ERIC Portal

Text Mining Tools

Text Mining Services

Other Tools

NetOwl Extractor

NetOwl®Algorithm

NetOwl Config

ERIC Custom Config

BRC-Central

Web

Ser

vice

HT

TP

Res

ults

ResultsPubMed

&Other

Document Sources

Input

*Software licensed for use on ERIC bounded in red.

Ontologies and patterns developed to

mine text

Pattern Writers

greene bosc2008

Technology