rdap14: maryann martone, keynote, the neuroscience information framework

Post on 12-Nov-2014

684 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Research Data Access and Preservation Summit, 2014 San Diego, CA March 26-28, 2014 Maryann Martone, Principal Investigator, Neuroscience Information Framework, University of California, San Diego

TRANSCRIPT

The Neuroscience Information Framework

Maryann E. Martone, Ph. D.University of California, San Diego

We say this to each other all the time, but we set up systems for scholarly advancement and communication that are the antithesis of integration

Whole brain data (20 um

microscopic MRI)

Mosiac LM images (1 GB+)

Conventional LM images

Individual cell morphologies

EM volumes & reconstructions

Solved molecular structures

No single technology serves these all equally well.Multiple data types;

multiple scales; multiple databases

A data integration problem

• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the

neuroscience community?– How many are there?– What domains do they cover? What domains do they not cover?– Where are they?

• Web sites• Databases• Literature• Supplementary material

– Who uses them?– Who creates them?– How can we find them?– How can we make them better in the future?

http://neuinfo.org

• PDF files

• Desk drawers

How many resources are there?

•NIF Registry: A catalog of neuroscience-relevant resources

• > 10,000 currently listed

• > 2500 databases•And we are finding more every day

June10, 2013 4

But we have Google!

• Current web is designed to share documents– Documents are

unstructured data

• Much of the content of digital resources is part of the “hidden web”

• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

Which databases do you use?

• Mouse Genome Database

• Allen Brain Atlas• Clinical Trials.gov• Pub Med• dbGAP• GEO• NIH Reporter• OMIM

• Bionumbers:– -a database of numerical values

extracted from literature

• Epigenomics– - human epigenomic data to

catalyze basic biology and disease-oriented research

• Antibody Registry– -2M antibodies

• BioGrid– an interaction repository of

protein and genetic interactions

June10, 2013 6Most resources are largely unknown and underutilized

NIF: A New Type of Entity for New Modes of Scientific Dissemination

• NIF’s mission is to maximize the awareness of, access to and utility of research resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain,

funding agency, institute or community– NIF is like a “Pub Med” for all biomedical resources and a “Pub

Med Central” for databases– Makes them searchable from a single interface– Practical and cost-effective; tries to be sensible– Learned a lot about the effective data sharing

How do resources get added to the NIF?•NIF curators•Nomination by the community•Semi-automated text mining pipelines

NIF RegistryRequires no special

skillsSite map available for

local hosting

•NIF Data Federation• DISCO interop• Requires some

programming skill• Open Source Brain < 2

hr

Two tiered system: low barrier to entry

NIF searches across 3 main indices: Registry, Federation and Literature

Data Federation:200 databases/400M

recordsRegistry: 6300

resources(2500 databases)

Literature: 22 million articles

What resources are available for GRM1?

With the thousands of databases and other information sources available, simple descriptive metadata will not suffice

NIF makes it easier to browse different databases

Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms

and related conceptsBoolean queries

Data sources categorized by

“data type” and level of nervous

system

Common views across multiple

sources

Tutorials for using full resource when getting there from

NIF

Link back to record in

original source

Making it easier to access and understand distributed databases

Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

NIF Semantic Framework: NIFSTD ontology

• NIF covers multiple structural scales and domains of relevance to neuroscience• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene

Ontology, Chebi, Protein Ontology

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellular structure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF capitalizes on the growing set of community ontologies available in biomedical science

NIF Concept Mapper: Reducing false positives

Is there a framework for neuroscience?

• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories:– Organism– Anatomical structure– Cell– Molecule– Function– Dysfunction– Technique

• When NIF combines multiple sources, a set of common fields emerges– >Basic information

models/semantic models exist for certain types of entities

Biomedical science does have a conceptual framework

PurkinjeCell

AxonTerminal

Axon DendriticTree

DendriticSpine

Dendrite

Cell body

Cerebellarcortex

Bringing knowledge to data: Ontologies as framework

There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent

: CNeurolex: > 1 million triples

Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

• Incorporate basic neuroscience knowledge into search– Google: searches for string “GABAergic

neuron)– NIF automatically searches for types of

GABAergic neuronsTypes of GABAergic

neurons

NIF Concept-Based Search

Neuroscience Information Framework – http://neuinfo.org

Ontologies as a data integration framework

•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

01-10

11-100>101

Open World-Closed World: Mapping the knowledge - data space

Data Sources

NIF lets us ask: where isn’t there data? What isn’t studied? Why?

Forebrain

Midbrain

Hindbrain

01-10

11-100>101

The data space is not uniform

Data Sources

“The Data Homunculus”

Funding drives representation in the data space

What can we learn from the NIF Registry?

NIF supports a semantic model for describing research resources

24

Resource Curation

June10, 2013

• NIF Registry is hosted on Semantic Media Wiki platform Neurolex– Community can add,

review, edit without special privileges

– Searchable by Google– Integrated with NIF

ontologies– Graph structure

http://neurolex.org

Can we mine relationships between resources?

http://neuinfo.org

NIF semantic graph of research resources

Text mining gives a

picture of the most used resources

PDB

http://force11.org/Resource_identification_initiative

• Automated text mining is used to look for “web page last updated” or copyright dates

– Identified for 570 resources– 373 were not updated within the last 2

years (65%)• Manual review of ~200 resources

– 38 not updated within the past 2 years (~20%)

– 8 migrated to new addresses or institutions– 7 are no longer in service (~3%)– 3 were deemed no longer appropriate

Tracking digital resources since 2008

NIF helps stabilize the dynamic resource landscape

Keeping content up to dateConnectome

Tractography

Epigenetics

•New tags come into existence•New resource types come into existence, e.g., Mobile apps•Resources add new types of content

• Change name• Change scope

•> 7000 updates to the registry last year

It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review

What can we learn from the NIF Data Federation?

NIF supports a semantic model for describing research resources

dkCOIN Investigator's Retreat 29Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-1310000

100000

1000000

10000000

100000000

1000000000

0

50

100

150

200

250

Num

ber o

f Fed

erat

ed R

ecor

ds (M

illio

ns)

Num

ber o

f Fed

erat

ed D

atab

ases

Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web

DISCO

June10, 2013

What do you mean by data?Databases come in many shapes and sizes

• Primary data:– Data available for reanalysis, e.g.,

microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

• Secondary data– Data features extracted through

data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

• Tertiary data– Claims and assertions about the

meaning of data• E.g., gene

upregulation/downregulation, brain activation as a function of task

• Registries:– Metadata– Pointers to data sets or materials

stored elsewhere

• Data aggregators– Aggregate data of the same type

from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

• Single source– Data acquired within a single

context , e.g., Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

What have we learned: Grabbing the long tail of small data

• NIF is in a unique position to ask questions against the data resource landscape

• The data space is not uniform• Data “flows” from one resource to

the next– Data is reinterpreted, reanalyzed or added

to

• Currently very difficult to track data as it moves across the landscape

– Makes it difficult to learn from combined efforts

NIF is trying to make it easier to work with diverse data

Phases of NIF

• 2006-2008: A survey of what was out there• 2008-2009: Strategy for resource discovery

– NIF Registry vs NIF data federation– Ingestion of data contained within different technology platforms, e.g., XML vs relational

vs RDF– Effective search across semantically diverse sources

• NIFSTD ontologies

• 2009-2011: Strategy for data integration– Unified views across common sources– Mapping of content to NIF vocabularies

• 2011-present: Data analytics– Uniform external data references

• 2012-present: SciCrunch: unified biomedical resource services

NIF provides a strategy and set of tools applicable to all biomedical science

NIF team (past and present)

Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum

Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceSvetlana SulimaDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer

And my colleagues in Monarch, dkNet, 3DVC, Force 11

top related