use of data

Keynote presented at thePhenotype Foundation first annual meeting.

Amsterdam, January 18, 2016

Prof. Chris EveloDepartment Bioinformatics – BiGCaTMaastricht University@Chris_Evelo

The use and needs of data sharing in biology

Data

• Things we know• Things we measure

Knowledge is hard to get

And it doesn’t even play it…

But you can gamify collection

Since we structure it, it can be easier to store

Sharing Data

I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…

I need to combine AD and HD data…

I can help with that!


Source: Marcos Roos

Who wants to share data?• People who want to use data• Funders• Publishers• But the researchers?

You only need MS-Excel

People hide data

• I did all this work I want to reuse• They don’t need this part, might be my next…• I might get a patent on this• Or… It needs a patent to be valuable• I can’t even patent because ...

How?

• Don’t add specifics(ohh those really were knockout cells, but..)

• Leave out important steps(I did these PCRs, why show the array)

• And “we used an approach slightly modified from…”

• ...

FAIR data

• Findable• Accessible• Interoperable• Reusable

Sharing Data

I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…

I need to combine AD and HD data…



Source: Marcos Roos

Sharing Data

Source: Marcos Roos

???

Here’s my data, have fun!

Here’s my data, have fun!

Sharing Linkable Data

Source: Marcos Roos

I can go straight to answering my questions with data from multiple data owners!

Patients will be so pleased with this speed-up!

Here’s my Linked Data,

have fun!Here’s my Linked Data,

have fun!

Really?

From terms “liver, hepar, hepatic tissue”

To URI’s: http://identifiers.org/tissueont1/liverhttp://identifiers.org/tissueont2/hepar….

Just a first step

http://identifiers.org/tissueont1/liver

http://identifiers.org/tissueont2/hepar

And we didn’t even get that…

Reality:

Ontology inspired pull-down menu’s

Nothing is ever “same-as”

• We may need more meaningful predicates• Or learn to use the better• We need lenses, context matters

Too many standards

Source XKCD: https://xkcd.com/927/

Too many standards

And ontologies…

But they are there for a reason! Research fields have different focus/needs

Don’t standardise, map!

We need mapping

• Ontology mapping• Identifier mapping• Identity (text mapping)• Chemistry mapping

We need mapping

• Ontology mapping: NCBO• Identifier mapping: BridgeDb, IMS• Identity (text) mapping: Conceptwiki?• Chemistry mapping: CRS??

There is a lot out there

Discussed last Friday:Serum and adipose tissue amino acid homeostasis in

the MHO (Badoud 2014)– Objective: Integrate metabolite and gene expression profiling to elucidate the

molecular distinctions between Metabolically Healthy Obese (MHO) and Metabolically Unhealthy Obese (MUO)

• Conclusion: SAT gene expression profiling revealed that genes related to branched-chain amino acid catabolism and the tricarboxylic acid cycle were less down-regulated in MHO individuals compared to MUO individuals. Together, this integrated analysis revealed that MHO individuals have an intermediate amino acid homeostasis compared to LH and MUO individuals.

– (Diabetes Risk Assessment study) 3 groups: Lean Healthy (LH), MHO and MUO• Fasting serum samples from all participants and adipose tissue from the periumbilical region under local anesthesia after an

overnight fast

– Initially 30 participants, 10 in each group (7 women, 3 men), but for the Microarray Analysis they analyzed SAT from 7 LH, 8 MHO and 8 MUO each group having 2 men. Not very clear why->They selected samples having RNA integrity number higher than 8

– Gene expression data only for the 23 participants – No gender or biological information (e.g glucose, total triglycerides, etc)– Not initial serum metabolites concentration (only mean)– dx.doi.org/10.1021/pr500416v– Data can be found: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200

Adding phenotypic data

Diversity, not size, makes big data hard

SAM module- small assays- diverse assays

For now annotation, used after you find it

Repositories are technology driven

• Expression data• Protein data• Metabolomics data• Genetic variation data

Repositories are technology driven

• Expression data: ArrayExpress, GEO• Protein data: PRIDE• Metabolomics data: MetaboLight• Genetic variation data: dbSNP

Start with the samples?

Or the studies?

ISA-tab inspiredinvestigations links to studieswhich link to assayssamplesand the actual data

Study capturing…

Capturing needs meta-ontologiesExamples:EFO (experimental factor ontology), eNanomapper (nanomaterials)

•Combine•Map•Slim•Extend•Feed extensions back to source•Reproduce from (extended) source

If you can find it in a database

Can you find the database?

Discoverable fairports?

What about institute repo’s?

If study in dbNP

• Large data in repo’s (e.g. MetaboLight)• Study descriptions still hidden

Combine with knowledge

• Can you find a study by the results?• Integrate results

(pathway and ontology profiles)

Challenges needed

Teams answering real questions

• Finds needs and solutions• Combines across communities• Fun! And inspiring• Interesting, publishable results

Starting a database is easy

• What about sustainability:• Core resources need:

– Long time funding– Regular monitoring

• Integration in communities

use of data

Science