greene bosc2008
TRANSCRIPT
BOSC 2008 Lightning Talk:
The Enteropathogen Resource Integration Center (ERIC), A NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-
emerging Infectious Diseasehttp://www.ericbrc.org
D. Pot1, J. Whitmore1, M. Shaker1, J. Fedorko1, K. Joshi1, S. Nanan1, P. Shetty1, J. Thangiah1, S. Zaremba1, G. Plunkett, III2, J. Glasner2, B. Anderson2, D. Baumler2, B. Biehl2, V. Burland2, E. Cabot2,
E. Neeno-Eckwall2, B. Mau2, P. Liss2, M. Rusch2, F. R. Blattner2, N. T. Perna2, J. M. Greene1
1SRA International, Inc., Rockville MD and2University of Wisconsin, Madison WI
• ERIC is a NIAID Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Disease, one of 8 such centers funded in July 2004 for 5 years.
• ERIC primarily focuses on the integration of data from five enteropathogens as well as related reference organisms:– Diarrheagenic E. coli– Shigella spp.– Salmonella spp.– Yersinia enterocolitica– Yersinia pestis
• Partnership between personnel at the Genome Center of Wisconsin (Nicole Perna, Fred Blattner, Guy Plunkett) and SRA International’s Global Health Sector, Rockville MD.
• Everything done under contract funding required to be made freely available to the Scientific Community.
ERIC-Overview
Genomes Annotations (ASAP)
Genome Views and Comparisons (Mauve, GBrowse) Microarray Analysis (mAdb)
ERIC is a portal based system using the JBoss portal. ASAP (A Systematic Annotation Package for community annotation) from UW-Madison is being used to allow the scientific community to annotate genes for the five enteropathogens and related reference organism useful for comparative genomics.
ERIC Portal Home Page
• ERIC contains tools for comparative genomics, such as Mauve, which has the distinct advantage of allowing comparison of more than two genomes, as well as being able to handle chromosomal rearrangements. (We provide access to some other pathogenic and non-pathogenic reference genomes, particularly for E. coli.)
Comparative Genome Analysis in Mauve
7111778-028
Mauve – whole genome comparison
• Mauve identifies and aligns regions of local collinearity called locally collinear blocks (LCBs). Each locally collinear block is a homologous region of sequence shared by two or more of the genomes under study, and does not contain any rearrangementsof homologous sequence.
• The Mauve genome alignment procedure results in a global alignment of each locally collinear block that has sequence elements conserved among all the genomes under study. Nucleotides in any given genome are aligned only once to other genomes,suggesting orthology among aligned residues. Mauve makes no attempt to align paralogous regions.
• The remaining unaligned regions may be lineage-specific sequence or rearranged or paralogous repetitive regions and can be identified as suchduring subsequent processing with other tools.
• Available at: http://gel.ahabs.wisc.edu/mauve/download.php
Mauve
• SRA is an industry leader in natural language processing (NLP)-based text mining
• Dedicated group of linguists and software engineers• Routinely win Government text mining competitions (e.g. Message
Understanding Competitions (MUC))
• Extensive experience in multilingual information extraction, text clustering, and text summarization – this is not just keyword searching.
• Numerous commercial and government clients/applicationsHealth care organizations (fraud detection); Financial services (anti-money laundering, e-mail surveillance); Government (homeland security, e-
Government, business intelligence)
• See Poster S04, Extraction of Facts and Relationships Relevant to Molecular Mechanisms of Bacterial Pathogenesis through Natural Language Processing, for details on how this is used in ERIC!
Text Mining
•Latest ArticlesLatest Articles tab – we present the mined extracts on enteropathogens for the preceding week.
Text Mining – Current Awareness
• Search TabSearch Tab – allows users to search across extracted data
• Unlike Latest ArticlesLatest Articles, this is not limited to our contract enteropathogens, and should be useful across all bacteria.
Text Mining - Search
• Currently, in addition to processing all new PubMed abstracts weekly, we are extracting about 4-5,000 abstracts per night, and have extracted all PubMed abstracts back about four years.
• We intend to go back at least 10 years….
• No reason this cannot be applied to Open Access full length text; also will provide Web Services access to extracted data in near future…
• Search Results:
Text Mining – Search Results
Extracted Termsand Relationships(requirements from Biologists)
Frequency of Termsand Coloration Control(requirements from IT Types)
Extracted Text from PubMed
What is different?
• Hundreds of abstracts to read (days)
• Limited, keyword searching
• Data handling complex (stacks of paper)
• Slower ability to reach conclusions
• Quick summary provided (seconds)
• Enhanced role searching
• Knowledge base with links to details
• Faster conclusions through mining of extracted data
Before Now
Other ERIC Notes• Again, everything we do under the contract must
be made freely available to the Scientific Community – all SRA’s work is available under the MIT License, and components from UW are under the GNU GPL.
• Posters - Monday evening session:– I-04 on ERIC System (John Greene)– S-04 on Text Mining (David Pot)
• For more information, contact [email protected]. • ERIC is supported via NIAID contract HSN266200400040C.
http://www.ericbrc.org
NetOwl ExtractorOptimizing Manual Literature Annotation
Scientist
ERIC-BRC System
ERICDatawarehouse
ERIC Portal
Text Mining Tools
Text Mining Services
Other Tools
NetOwl Extractor
NetOwl®Algorithm
NetOwl Config
ERIC Custom Config
BRC-Central
Web
Ser
vice
HT
TP
Res
ults
ResultsPubMed
&Other
Document Sources
Input
*Software licensed for use on ERIC bounded in red.
Ontologies and patterns developed to
mine text
Pattern Writers