big data from small data: a deep survey of the neuroscience landscape data via

42
Big data from small data: A deep survey of the neuroscience landscape data via the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego

Upload: neuroscience-information-framework

Post on 11-May-2015

1.291 views

Category:

Health & Medicine


1 download

DESCRIPTION

Maryann Martone - Keynote at Bernstein Computational Neuroscience conference, Munich, 2012

TRANSCRIPT

Page 1: Big data from small data:  A deep survey of the neuroscience landscape data via

Big data from small data: A deep survey of the

neuroscience landscape data via

the Neuroscience Information Framework

Maryann Martone, Ph. D.University of California, San Diego

Page 2: Big data from small data:  A deep survey of the neuroscience landscape data via

“Neural Choreography”

“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....

However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “

Akil et al., Science, Feb 11, 2011

Page 3: Big data from small data:  A deep survey of the neuroscience landscape data via

“Data choreography” In that same issue of Science

Asked peer reviewers from last year about the availability and use of data About half of those polled store their data only in their

laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and

archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving

And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.

“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )

Page 4: Big data from small data:  A deep survey of the neuroscience landscape data via

Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community

Whole brain data (20 um microscopic

MRI) Mosiac LM images (1

GB+)

Conventional LM images

Individual cell morphologies

EM volumes & reconstruction

s

Solved molecular structures

No single technology serves these all equally well.Multiple data types;

multiple scales; multiple databases

A data federation problem

Page 5: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials,

services) are available to the neuroscience community?

How many are there? What domains do they cover? What domains do

they not cover? Where are they?

Web sites Databases Literature Supplementary material

Who uses them? Who creates them? How can we find them? How can we make them better in the future?

http://neuinfo.org

• PDF files

• Desk drawers

Page 6: Big data from small data:  A deep survey of the neuroscience landscape data via

We need more databases (?)

•NIF Registry: A catalog of neuroscience-relevant resources

• > 5000 currently listed

• > 2000 databases

•And we are finding more every day

Page 7: Big data from small data:  A deep survey of the neuroscience landscape data via

But we have Google!

Current web is designed to share documents Documents are

unstructured data Much of the content

of digital resources is part of the “hidden web”

Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

Page 8: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF must work with ecosystem as it is today

NIF has developed a production technology platform for researchers to discover, share, access, analyze, and integrate neuroscience-relevant information Semantically-enabled search engine and interface that

customizes results for neuroscience System that searches the “hidden web”, i.e., content not well

served by search engines Data resources are predominantly relational, xml, text, rdf, owl

Automated data harvesting technologies that produce dynamic indices of data content including databases, web pages, text, xml etc.

Tools to make products and data available Designed to be populated rapidly; set up process for

progressive refinement

Page 9: Big data from small data:  A deep survey of the neuroscience landscape data via

UCSD, Yale, Cal Tech, George Mason, Washington Univ

NIF accomplishments Assembled the largest

searchable collation of neuroscience data on the web Data federation Resource registry (materials,

data, tools, services) Pub Med literature

Full text of open access

The largest ontology for neuroscience

NIF search portal: simultaneous search over data, NIF catalog and biomedical literature

Neurolex Wiki: a community wiki serving neuroscience concepts

A unique technology platform

A reservoir of cross-disciplinary biomedical data expertise

NIF is poised to capitalize on the new tools and emphasis on big data and open science

Page 10: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF data federation

Images

Drugs

Anti-bodies

Grants

Pathways

Animals

Percentage of data records per data type

connectivity

Brain activation foci

Microarray98%

Percentage of data records per data type: everything but

microarray

> 180 sources; 350 M records: NIF was designed to be populated rapidly, with progressive refinement of data

Page 11: Big data from small data:  A deep survey of the neuroscience landscape data via

What do you mean by data?Databases come in many shapes and

sizes Primary data:

Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

Secondary data Data features extracted

through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

Tertiary data Claims and assertions

about the meaning of data E.g., gene

upregulation/downregulation, brain activation as a function of task

Registries: Metadata Pointers to data sets or

materials stored elsewhere Data aggregators

Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

Single source Data acquired within a

single context , e.g., Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

Page 12: Big data from small data:  A deep survey of the neuroscience landscape data via

What types of questions can I ask?

We’d like to be able to find:What is known****:

What is the average diameter of a Purkinje neuron

Is GRM1 expressed In cerebral cortex? What are the projections of hippocampus? What genes have been found to be

upregulated in chronic drug abuse in adults Is there a database of fMRI studies? What studies used my polyclonal antibody

against GABA in humans? What rat strains have been used most

extensively in research during the last 20 years?

What is not known: Connections among data Gaps in knowledge

Without some sort of framework, very difficult to do

Page 13: Big data from small data:  A deep survey of the neuroscience landscape data via

What are the connections of the hippocampus?

Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion:

Synonyms and related concepts

Boolean queriesData sources

categorized by “data type” and level of nervous

system

Common views across multiple

sources

Tutorials for using full

resource when getting there

from NIF

Link back to record in

original source

Page 14: Big data from small data:  A deep survey of the neuroscience landscape data via

Results are organized within a common framework

Connects to

Synapsed with

Synapsed by

Input region

innervates

Axon innervates

Projects to

Cellular contact

Subcellular contact

Source site

Target site

Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases

Page 15: Big data from small data:  A deep survey of the neuroscience landscape data via

The scourge of neuroanatomical nomenclature: Importance of NIF

semantic framework•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

Page 16: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF’s minimum requirements for effective data sharing

You (and the machine) have to be able to find itAccessible through the webAnnotations

You have to be able to use itData type specified and in a usable

formYou have to know what the data

meanSome semanticsContext: Experimental metadataProvenance: Where did the data

come from?

Reporting neuroscience data within a consistent framework helps enormously

Page 17: Big data from small data:  A deep survey of the neuroscience landscape data via

What is an ontology?

Brain

Cerebellum

Purkinje Cell Layer

Purkinje cell

neuron

has a

has a

has a

is a

Ontology: an explicit, formal representation of concepts relationships among them within a particular domain that expresses human knowledge in a machine readable form

Branch of philosophy: a theory of what is

e.g., Gene ontologies

Page 18: Big data from small data:  A deep survey of the neuroscience landscape data via

“Ontology as mathematics, computer science or esperanto”-Andrey Rzhetsky and James A. Evans

You need to use ontology

identifiers instead of

strings

Blah, blah, ontology

blah

Page 19: Big data from small data:  A deep survey of the neuroscience landscape data via

What can ontology do for us?

Express neuroscience concepts in a way that is machine readable Classes are identified by unique identifiers Synonyms, lexical variants Definitions

Provide means of disambiguation of strings Nucleus part of cell; nucleus part of brain; nucleus part of atom

Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmitter

Properties Provide universals for navigating across different data

sources Semantic “index” Perform reasoning Link data through relationships not just one-to-one

mappings “Concept-based queries”

“Esperanto!”

Page 20: Big data from small data:  A deep survey of the neuroscience landscape data via

Power of unique identifiers: Are you the M Martone who...

The Gene Wiki: community intelligence applied to human gene annotation.Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch JB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9.

Ontologies for Neuroscience: What are they and What are they Good for? Larson SD, Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1.

Three-dimensional electron microscopy reveals new details of membrane systems for Ca2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, Holst MJ, Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13.

Traumatic brain injury and the goals of care. Martone M. Hastings Cent Rep. 2006 Mar-Apr;36(2):3.

Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of the cat. Groves PM, Martone M, Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.

Some analyses of forgetting of pictorial material in amnesic and demented patients. Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.

Page 21: Big data from small data:  A deep survey of the neuroscience landscape data via

I am not a number (but I should be)

Full URI: Uniform Resource Identifier http://orcid.org/

1234567 Label: Maryann

Elizabeth Martone Synonym: ME

Martone, M Martone, Maryann

Abbreviation: MEM Is a Has a Is that entity which

has these properties

M Martone

Dept of Psychiatry, UCSD

Nelson Butters

Publications

Boston VA

Hospital

Text mining algorithms can discover a lot of things about meORCID project: Author ID’s

Female

Page 22: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF Semantic Framework: NIFSTD ontology

NIF covers multiple structural scales and domains of relevance to neuroscience

Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology

Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks for more complex representations

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellular structure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

Page 23: Big data from small data:  A deep survey of the neuroscience landscape data via

“We studied the behavior of CA2-binding proteins in Ca2 neurons under

high and low Ca2 conditions ”

BioGridAllen Brain AtlasBrain Info

NIF queries across over 170+ independent databases

Page 24: Big data from small data:  A deep survey of the neuroscience landscape data via

But you don’t have what I need!

http://neurolex.org Stephen Larson/INCF

•Provide a simple framework for defining the concepts required

• Cell, Part of brain, subcellular structure, molecule

•Community based:• Communities

contribute their vocabularies

• Reconcile and align concepts used by different domains

•Each concept gets its own unique identifier

•Creating a computable index for neuroscience data

• INCF

Demo D03

Page 25: Big data from small data:  A deep survey of the neuroscience landscape data via

Concept-based search: search by meaning

Search Google: GABAergic neuron Search NIF: GABAergic neuron

NIF automatically searches for types of GABAergic neurons

Types of GABAergic neurons

Page 26: Big data from small data:  A deep survey of the neuroscience landscape data via

Esperanto!

“The trouble is that if I make up all of my own URIs, my [data] has no meaning to anyone else unless I explain what each URI is intended to denote or mean. Two [data sets] with no URIs in common have no information that can be interrelated.”

NIF favors reuse of identifiers rather than mapping NIF imports many ontologies

Creating ontologies to be used as common building blocks: modularity, low semantic overhead, is important Many community ontologies available covering multiple

domains NIFSTD available via web serivices Bioportal (http://bioportal.bioontology.org/)

http://www.rdfabout.com/intro/#Introducing%20RDF

Page 27: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF Analytics: The Neuroscience Ecosystem

NIF is in a unique position to answer questions about the neuroscience ecosystem

Where are the data?

StriatumHypothalamusOlfactory bulb

Cerebral cortex

Brain

Bra

in r

eg

ion

Data source

Vadim Astakhov, Kepler Workflow Engine

Page 28: Big data from small data:  A deep survey of the neuroscience landscape data via

Whither neuroscience information?

What is easily machine processable and

accessible

What is potentially knowable

What is known:Literature, images, human knowledge

Unstructured; Natural language processing, entity

recognition, image processing

and analysis; communication

Page 29: Big data from small data:  A deep survey of the neuroscience landscape data via

Open world meets closed world

Query for “reference” brain structures and their parts in NIF Connectivity database

But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray records

Page 30: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF Reports: Male vs Female

Gender bias

NIF can start to answer interesting questions about neuroscience research, not just about neuroscience

Page 31: Big data from small data:  A deep survey of the neuroscience landscape data via

What have we learned: Grabbing the long tail of small data

Analysis of NIF shows multiple databases with similar scope and content

Many contain partially overlapping data

Data “flows” from one resource to the next Data is

reinterpreted, reanalyzed or added to

Is duplication good or bad?

Page 32: Big data from small data:  A deep survey of the neuroscience landscape data via

Embracing duplication: Data Mash ups

•NIF queries across 3 of approximately 10 fMRI databases•~300 PMID’s were common between Brede and SUMSdb

• PMID serves as a unique identifier for an article•Same information; value added

Same data; different aspects

Page 33: Big data from small data:  A deep survey of the neuroscience landscape data via

Same data: different analysis

Gemma: Gene ID + Gene SymbolDRG: Gene name + Probe ID

Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases

Analysis: 1370 statements from Gemma regarding

gene expression as a function of chronic

morphine 617 were consistent with DRG; over half of

the claims of the paper were not confirmed in this analysis

Results for 1 gene were opposite in DRG and Gemma

45 did not have enough information provided in the paper to make a judgment

Chronic vs acute morphine in striatum

Page 34: Big data from small data:  A deep survey of the neuroscience landscape data via

Taking a global view on data: microculture to ecosystem

Several powerful trends should change the way we think about our data: One ManyMany data

Generation of data is getting easier shared data Data space is getting richer: more –omes everyday But...compared to the biological space, still sparse

Many eyes Wisdom of crowds More than one way to interpret data

Many algorithms Not a single way to analyze data

Many analytics “Signatures” in data may not be directly related to the

question for which they were acquired but tell us something really interesting

Are you exposing or burying your work?

Page 35: Big data from small data:  A deep survey of the neuroscience landscape data via

The future of scientific communication

We have learned over the years how to write a scientific paper for other humans to read and for other agents to index We now have to learn how to write

papers for automated agents (and their humans) to mine

We have learned over the years to report data in papers for humans to read We now have to learn how to

publish data in a form and on a suitable platform for automated agents (and their humans) to mine

Reporting neuroscience data within a consistent framework helps enormously

Printing press

Linked data cloud

Watson

Page 36: Big data from small data:  A deep survey of the neuroscience landscape data via

Why does it matter?47/50 major preclinical

published cancer studies could not

be replicated “The scientific community assumes that the claims in a preclinical study can be taken at face value-that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of time. Unfortunately, this is not always the case.”

Getting data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop better metrics to evaluate the validity of data

Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531

“There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process. “

Data, not just stories about them!

Page 37: Big data from small data:  A deep survey of the neuroscience landscape data via

Community

database:

beginning

Community

database: End

Register your resource to NIF!

“How do I share my data?”

“There is no database for my

data”

1

2

3

4

Institutional

repositories

CloudINCF: Global

infrastructure

Government

Education

Industry

NIF is designed to leverage existing investments in resources and infrastructure

Page 38: Big data from small data:  A deep survey of the neuroscience landscape data via

It’s a messy ecosystem (and that’s OK)

NIF favors a hybrid, tiered, federated system

Domain knowledge Ontologies

Claims about results Virtuoso RDF

triples

Data Data federation Workflows

Narrative Full text access

NeuronBrain part

Disease

Organism

Gene

Caudate projects to Snpc Grm1 is

upregulated in chronic cocaineBetz cells

degenerate in ALS

Page 39: Big data from small data:  A deep survey of the neuroscience landscape data via

Future of Research Communications and e-Scholarship

FORCE11: http://force11.org Founded by Phil Bourne,

Tim Clark, Ed Hovy, Anita de Waard and Ivan Herman

Bring together stakeholders with an interest in moving scholarly communication beyond reliance on papers and traditional impact metrics

Beyond the PDF 2: Spring 2013

Page 40: Big data from small data:  A deep survey of the neuroscience landscape data via

NIF team (past and present)

Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum

Fahim Imam, NIF Ontology EngineerLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceLee HornbrookBinh NgoVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer

Page 41: Big data from small data:  A deep survey of the neuroscience landscape data via

Why do we create so many overlapping products?

“That which I cannot build, I cannot understand”

Don’t trust any data you haven’t generated

Oh, now I see what you are saying

Scientists know the domain, not informatics

Science is incremental; we

build on the results of others

It’s ingrained in our culture

“Build a better mousetrap and the world will beat down our doors”

Little credit for making someone else’s product better

Yes, we are planning to do

that... We are all time and

resource constrained We extend projects in

time

There’s more than way to skin a cat....

We are still mastering the medium

Technology is developing fast

Page 42: Big data from small data:  A deep survey of the neuroscience landscape data via

When I talk to resource providers, neuroscientists (and journal editors)...

You need to use ontology

identifiers instead of

strings

Blah, blah, ontology

blah