use of data

35
Keynote presented at the Phenotype Foundation first annual meeting. Amsterdam, January 18, 2016 Prof. Chris Evelo Department Bioinformatics – BiGCaT Maastricht University @Chris_Evelo The use and needs of data sharing in biology

Upload: chris-evelo

Post on 21-Jan-2017

364 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Use of data

Keynote presented at thePhenotype Foundation first annual meeting.

Amsterdam, January 18, 2016

Prof. Chris EveloDepartment Bioinformatics – BiGCaTMaastricht University@Chris_Evelo

The use and needs of data sharing in biology

Page 2: Use of data

Data

• Things we know• Things we measure

Page 3: Use of data

Knowledge is hard to get

And it doesn’t even play it…

But you can gamify collection

Since we structure it, it can be easier to store

Page 4: Use of data

Sharing Data

I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…

I need to combine AD and HD data…

I can help with that!

I can help with that!

Source: Marcos Roos

Page 5: Use of data

Who wants to share data?• People who want to use data• Funders• Publishers• But the researchers?

Page 6: Use of data

You only need MS-Excel

Page 7: Use of data

People hide data

• I did all this work I want to reuse• They don’t need this part, might be my next…• I might get a patent on this• Or… It needs a patent to be valuable• I can’t even patent because ...

Page 8: Use of data

How?

• Don’t add specifics(ohh those really were knockout cells, but..)

• Leave out important steps(I did these PCRs, why show the array)

• And “we used an approach slightly modified from…”

• ...

Page 9: Use of data

FAIR data

• Findable• Accessible• Interoperable• Reusable

Page 10: Use of data

Sharing Data

I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…

I need to combine AD and HD data…

I can help with that!

I can help with that!

Source: Marcos Roos

Page 11: Use of data

Sharing Data

Source: Marcos Roos

???

Here’s my data, have fun!

Here’s my data, have fun!

Page 12: Use of data

Sharing Linkable Data

Source: Marcos Roos

I can go straight to answering my questions with data from multiple data owners!

Patients will be so pleased with this speed-up!

Here’s my Linked Data,

have fun!Here’s my Linked Data,

have fun!

Page 13: Use of data

Really?

From terms “liver, hepar, hepatic tissue”

To URI’s: http://identifiers.org/tissueont1/liverhttp://identifiers.org/tissueont2/hepar….

Just a first step

Page 14: Use of data

And we didn’t even get that…

Reality:

Ontology inspired pull-down menu’s

Page 15: Use of data

Nothing is ever “same-as”

• We may need more meaningful predicates• Or learn to use the better• We need lenses, context matters

Page 16: Use of data

Too many standards

Source XKCD: https://xkcd.com/927/

Page 17: Use of data

Too many standards

And ontologies…

But they are there for a reason! Research fields have different focus/needs

Don’t standardise, map!

Page 18: Use of data

We need mapping

• Ontology mapping• Identifier mapping• Identity (text mapping)• Chemistry mapping

Page 19: Use of data

We need mapping

• Ontology mapping: NCBO• Identifier mapping: BridgeDb, IMS• Identity (text) mapping: Conceptwiki?• Chemistry mapping: CRS??

Page 20: Use of data

There is a lot out there

Page 21: Use of data

Discussed last Friday:Serum and adipose tissue amino acid homeostasis in

the MHO (Badoud 2014)– Objective: Integrate metabolite and gene expression profiling to elucidate the

molecular distinctions between Metabolically Healthy Obese (MHO) and Metabolically Unhealthy Obese (MUO)

• Conclusion: SAT gene expression profiling revealed that genes related to branched-chain amino acid catabolism and the tricarboxylic acid cycle were less down-regulated in MHO individuals compared to MUO individuals. Together, this integrated analysis revealed that MHO individuals have an intermediate amino acid homeostasis compared to LH and MUO individuals.

– (Diabetes Risk Assessment study) 3 groups: Lean Healthy (LH), MHO and MUO• Fasting serum samples from all participants and adipose tissue from the periumbilical region under local anesthesia after an

overnight fast

– Initially 30 participants, 10 in each group (7 women, 3 men), but for the Microarray Analysis they analyzed SAT from 7 LH, 8 MHO and 8 MUO each group having 2 men. Not very clear why->They selected samples having RNA integrity number higher than 8

– Gene expression data only for the 23 participants – No gender or biological information (e.g glucose, total triglycerides, etc)– Not initial serum metabolites concentration (only mean)– dx.doi.org/10.1021/pr500416v– Data can be found: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200

Page 22: Use of data

Discussed last Friday:Serum and adipose tissue amino acid homeostasis in

the MHO (Badoud 2014)– Objective: Integrate metabolite and gene expression profiling to elucidate the

molecular distinctions between Metabolically Healthy Obese (MHO) and Metabolically Unhealthy Obese (MUO)

• Conclusion: SAT gene expression profiling revealed that genes related to branched-chain amino acid catabolism and the tricarboxylic acid cycle were less down-regulated in MHO individuals compared to MUO individuals. Together, this integrated analysis revealed that MHO individuals have an intermediate amino acid homeostasis compared to LH and MUO individuals.

– (Diabetes Risk Assessment study) 3 groups: Lean Healthy (LH), MHO and MUO• Fasting serum samples from all participants and adipose tissue from the periumbilical region under local anesthesia after an

overnight fast

– Initially 30 participants, 10 in each group (7 women, 3 men), but for the Microarray Analysis they analyzed SAT from 7 LH, 8 MHO and 8 MUO each group having 2 men. Not very clear why->They selected samples having RNA integrity number higher than 8

– Gene expression data only for the 23 participants – No gender or biological information (e.g glucose, total triglycerides, etc)– Not initial serum metabolites concentration (only mean)– dx.doi.org/10.1021/pr500416v– Data can be found: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200

Page 23: Use of data

Adding phenotypic data

Diversity, not size, makes big data hard

SAM module- small assays- diverse assays

For now annotation, used after you find it

Page 24: Use of data

Repositories are technology driven

• Expression data• Protein data• Metabolomics data• Genetic variation data

Page 25: Use of data

Repositories are technology driven

• Expression data: ArrayExpress, GEO• Protein data: PRIDE• Metabolomics data: MetaboLight• Genetic variation data: dbSNP

Page 26: Use of data

Start with the samples?

Page 27: Use of data

Or the studies?

ISA-tab inspiredinvestigations links to studieswhich link to assayssamplesand the actual data

Study capturing…

Page 28: Use of data

Capturing needs meta-ontologiesExamples:EFO (experimental factor ontology), eNanomapper (nanomaterials)

•Combine•Map•Slim•Extend•Feed extensions back to source•Reproduce from (extended) source

Page 29: Use of data

If you can find it in a database

Can you find the database?

Discoverable fairports?

What about institute repo’s?

Page 30: Use of data

If study in dbNP

• Large data in repo’s (e.g. MetaboLight)• Study descriptions still hidden

Page 31: Use of data

Combine with knowledge

• Can you find a study by the results?• Integrate results

(pathway and ontology profiles)

Page 32: Use of data

Challenges needed

Page 33: Use of data

Teams answering real questions

• Finds needs and solutions• Combines across communities• Fun! And inspiring• Interesting, publishable results

Page 34: Use of data

Starting a database is easy

• What about sustainability:• Core resources need:

– Long time funding– Regular monitoring

• Integration in communities

Page 35: Use of data