an introduction to the biomedical informatics research network (birn) gully apc burns information...

An Introduction to the Biomedical Informatics

Research Network (BIRN)Gully APC Burns

Information Sciences InstituteUniversity of Southern California

This Presentation• The BIRN Vision• Project History• Communities • Specific Capabilities

– Data Management – BIRN Pathology– Information Integration – BIRN Mediator– Knowledge Engineering – ‘KEfED’

• http://www.birncommunity.org

The Purpose of BIRN“to advance biomedical research

through data sharing and online collaboration”

Challenges … Opportunities

What is the main challenge we face?• Not just size of datasets …• Not just communication bandwidth …

The challenge is heterogeneity.• In 2010, NIH funded 31,232 RO1 grants• How do we provide practical data sharing

and collaboration solutions at that scale?

Use Cases and Capabilities

Focus on the things our users wantDevelop Capabilities – i.e., our ability to:

– Publish data / Deploy tools– Rapidly develop new lightweight systems

tailored for specialized use cases– Reconcile data across disparate sources – Provide security (credentials, encryption)– Develop and enable scientific reasoning

systems

Background and History• Funded by National Center for Research Resources in 2001.

• Three neuroimaging testbeds initially provided use cases for developed technology.

• Reorganized in 2008 to provide software-based solutions for data sharing in the biosciences. No longer focused on neuroimaging (or even imaging).

• One can no longer ‘log into BIRN’. New model of providing modular capabilities that users can fit into existing toolset.

Sites and Collaborators

Technical Approach

• Bottom up, not top down• Focus on user requirements - what

they want to do• Create solutions - factor out common

requirements– Capability model includes software and

process– Avoid “Big Design up Front” (BDUF)

Dissemination Models• Different problems require different kinds of

solutions.– BIRN operates services for all users; e.g., user

registration service– BIRN provides kits for project teams to deploy

services for their members; e.g., data sharing– BIRN provides downloadable tools for individuals to

use on their own; e.g., pathology databases, integration systems, reasoning systems, workflows.

• Understanding deployment needs is part of defining the problem to be solved.

Working Groups• New capabilities are defined,

designed, and disseminated by BIRN Working Groups– Operations– Security– Data Management – Information Integration – Knowledge Engineering– Workflow Tools – Genomics

• Projects are vetted by BIRN Steering Committee for appropriateness and resource planning

• Collaborators are expected to be actively involved in development and deployment.

• Collaborators are not expected to use complete BIRN capability kit but should pick and choose to suit their own needs.

Working with BIRN

Some of our Communities …• FBIRN

– Schizophrenic neuroimaging, fMRI data• Nonhuman Primates Research

Consortium – Colony management, pathology, genetics

• Cardiovascular Research Grid• Kinetics Foundation

– Parkinson’s disease therapeutics… and others

Data Management• BIRN supports data

intensive activities including:– Imaging, Microscopy,

Genomics, Time Series, Analytics and more…

• BIRN utilities scale:– TB data sets– 100 MB – 2 GB Files– Millions of Files

Data Service Deployment

BIRN PATHOLOGYProject Overview

BIRN Pathology Technology ‘Stack’

LAMP

Database Development

Virtual Microscop

ySearch

Application (e.g., Pathology)Data Integration

Security

File TransferData Management

“Capabilities”Other BIRN

“Capabilities” 3rd Party Tools Application Specific

key

Database Development

• In a nutshell:• Django + Extensions

• Features• Django Web Framework

• MVC, ORM, HTML Templates• Auto-generated data forms!

• RESTful interface routing• Spreadsheet batch importer• Crowd integration (users/groups/SSO)• Data object access control (ACLs)• Javascript UI widgets• Scheduled database backup

• System Requirements• Linux, Apache, MySQL or Postgres,

Python, Django

Use Cases

Scientists need to create custom web-accessible databases to share data within and between research groups.

Scientists store data on spreadsheets, need to manage this data (DBMS), and expose it programmatically (WS REST)

Virtual Microscopy

• Image Processing Tools• Accepts images in numerous

proprietary microscopy formats (Bio-Formats)

• Creates pyramidal, tiled image sets in open format (JPEG) used by Zoomify image viewer

• Extracts proprietary graphical annotation formats and converts them to Zoomify annotation format

• Image Server + Database• (Built on our Database Development

tools)• Monitors an image upload directory,

automatically processes images, and updates the image database

• Stores and serves annotations according to associated ACLs

Use Cases

Scientists need to share high resolution (100X) images over relatively low bandwidth networks.

Scientists need to annotate and share annotations on images.

Search• Features

• Single-site, full-text indexing of databases and text files.

• Search results returned as Python objects based on Django ORM mappings.

• ‘Advanced search’; i.e., multiple fields, and logical expressions.

• Easy to synchronize index with changes to Django-based object model.

• Technology• Based on Xapian and Djapian

Use Case

Scientists need to search their databases using full-text queries and field-based “advanced” search queries.

Pathology Application

• Features• Pathology Data Model

• Subject, Case, Specimen, Tissue, Image entities• Species, Disease, Etiology domain tables

• Pathology data entry workflows• UI forms tailored to the data entry workflow of

pathologists.• Microscopy support for proprietary

formats:• Aperio SVS (supported)• Hammamatsu NDPI (partially supported)• Olympus VSI (under development)

• And, of course, all of the features from the underlying software and services• Database Development + Virtual Microscopy +

Search• Data Integration using BIRN Mediator

• Capability to run federated queries across sites

Use Cases

Pathologists create images in proprietary microscopy formats, annotate the images, and need to upload them to a web-accessible database for sharing among collaborating sites.

Pathologists associate metadata and supplementary files with their microscopy images.

Screenshots

Information Integration• BIRN supports integration

across complex data sources– Can process wide variety of

structured & semi-structured sources (DBMS, XML, HTML, Excel, XML, SOAP

• Use Schema & Data– Source Modeling & Record

Linkage• Infrastructure capabilities

– Security, Efficient Query Execution, SQL-like syntax across multiple sources.

Decision Support

Application Programs

Mediator

KnowledgeBases

Databases Computer Programs

The Web

Information Mediator• Virtual Integration Architecture:

– Virtual organization: community of data providers and consumers that want to share data for a specific purpose

– Autonomous sources: data, control remains at sources; no change to access methods, schemas; data accessed real-time in response to user queries

– Mediator: integrator defines domain schema and describes source contents

• Domain schema: agreed upon view of the domain preferred by the virtual organization

• Source descriptions: logical formulas relating source and domain schemas

BIRN MEDIATORProject Overview

Information Mediator• Query Answering

– User writes query in domain schema– Mediator:

• Determines sources relevant to user query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources

– Efficient query evaluation• Streaming dataflow

The Information MediatorUser Queries / Web Portal / Services

Security

Information Integration

‘Capabilities’Other BIRN ‘Capabilities’ Application Specific

key

Execution Engine

Optimizer

Reformulation

Wrapper WrapperWrapper

Logical SourceDescriptions

Data Sources

Database Consolidation

• Construct a ‘virtual organization’ • A community of data providers and consumers

sharing data for common specific purpose

• All sources are autonomous • data, control remains at sources• no change to access methods, schemas; • data accessed real-time in response to user

queries

• Work consists of modeling domain schema and source contents• Domain schema = agreed upon view of the

domain preferred by the virtual organization• Source descriptions = logical formulas relating

source and domain schemas

• Implemented in multiple domains: • fMRI : Ashish, et al. (2010) “Neuroscience Data

Integration through Mediation: An (F)BIRN Case Study.” Front. Neuroinf. 4:118

• Cardiovascular Research Grid• Non-Human Primate Research Consortium• Child Neurodevelopmental Disorders

Use Cases

Scientists from different groups want to query across two databases with different schema

Databases may be completely different (i.e., one group uses Excel spreadsheets, another uses Filemaker Pro and a third uses Oracle)

Database Extension

• Reapply the mediator technology to sources from different subjects• e.g., link genetics data to imaging data.

• Not dependent on a universal, global ontology, but a locally-defined model specific for the application

• Develop models using Datalog that are then processed by the BIRN Mediator, and OGSA-DAI / OGSA-DQP technology.

Use Cases

Scientists want to bring together data from different sources into a a single, common domain model

Sources will require linkage at the level of schema and data

EZ-Mediator

• This is not a particular difficult task to manage by hand but automating it is even simpler within our framework • Small improvements can lead to

large gains!• Based on the same mediator technology as previously described

Use Cases

Can we, with the absolute minimum of effort, run queries across separate databases that have the same schema?

Screenshots

Knowledge Engineering

Scientific assertions as ‘Computable, citable

elements’• There are very large number of statements like ‘mice like

cheese’– semantics at this level are complicated!

• For example:– “Novel neurotrophic factor CDNF protects midbrain dopamine

neurons in vivo” [Lindholm et al 2007]– “Hippocampo-hypothalamic connections: origin in subicular

cortex, not ammon's horn.” [Swanson & Cowan 1975]– “Intravenous 2-deoxy-D-glucose injection rapidly elevates levels

of the phosphorylated forms of p44/42 mitogen-activated protein kinases (extracellularly regulated kinases 1/2) in rat hypothalamic parvicellular paraventricular neurons.” [Khan & Watts 2004]

• Assertions vary in their levels of reliability, specificity.• Can we introduce a generalized formalism that could support

automated reasoning?

Cycles of Scientific Investigation (‘CoSI’)

e.g., ‘CDNF protects nigral dopaminergic neurons in-vivo’

This statistically-significant effect is the experimental basis for the findings of this study.

Our ontology engineering approach is based on experimental variables

from Lindholm, P. et al. (2007), Nature, 448(7149): p. 73-7

Knowledge Engineering from

Experimental Design (‘KEfED’)

Khan et al. (2007), J. Neurosci. 27:7344-60 [expt 2]

KNOWLEDGE ENGINEERING FROM EXPERIMENTAL DESIGN‘KEfED’Project Overview

BioScholarApplication

• Develop a knowledge base framework for observations and interpretations from experiments.

• Scientists manually curate data by hand from publications into generic database driven by KEfED model

• Can reuse designs for multiple experiments

• Design process is intuitive, can build a database without informatics training• Ideal for non-computational biologists.• Java / Flex Web application, one click

install

Use Cases

Scientists want to develop a generic knowledge base driven from a corpus of PDF files stored locally within a specific laboratory

CruxApplication

• Scientists within a disease foundation must plot a whole research program

• How to keep track of hypotheses, experimental results and outcomes to plan the next phase of the project?

• System is just about to start year 2 of funding geared towards curation of raw data (not from publications).

• System can permit scientists with no computational expertise develop scientific data repositories.

• Project driven by the Kinetics Foundation (with initial funding from M.J. Fox Foundation).

Use Cases

Decision makers at a disease foundation want to store raw data generated a generic knowledge base driven from a corpus of PDF files stored locally within a specific laboratory

Screenshots

Summary• BIRN provides capability kits and

consulting to facilitate internal and external data sharing and other aspects of collaboration.

• BIRN takes a bottom-up, modular, capability-based approach that is driven by end user needs.

• End users are viewed as collaborators and are actively involved in development process.

BIRN Program Announcements

• Provides funds to work with BIRN on projects relating to data sharing (PAR-07-426) and the associated ontology (PAR-07-425) for the data.

• Mechanism to fund the leveraging of BIRN capabilities

• BIRN provides assistance in framing proposals and developing projects.

Contact InformationGeneral:

– [email protected] Project Manager:

– Joe Ames: [email protected] Outreach:

– Karl Helmer: [email protected]:

– http://www.birncommunity.org – https://wiki.birncommunity.org:8443/

Acknowledgements• Executive Team

– Carl Kesselman– Joe Ames– Karl Helmer

• Data Management– Ann Chervenak– Rob Schuler– David Smith– Laura Pearlman

• Information Integration– Jose Luis Ambite– Gowri Gumaraguruparan– Maria Muslea– Naveen Ashish

• Knowledge Engineering– Jessica Turner– Tom Russ

• Security– Rachana Ananthakrishnan

• Communities– NonHuman Primate

Research Consortium– FBIRN – CVRG– Mouse BIRN– Mouse Genome Informatics – Brain Architecture Group @

USC

And Many Many More…

an introduction to the biomedical informatics research network (birn) gully apc burns information...

Documents