use of data
TRANSCRIPT
Keynote presented at thePhenotype Foundation first annual meeting.
Amsterdam, January 18, 2016
Prof. Chris EveloDepartment Bioinformatics – BiGCaTMaastricht University@Chris_Evelo
The use and needs of data sharing in biology
Data
• Things we know• Things we measure
Knowledge is hard to get
And it doesn’t even play it…
But you can gamify collection
Since we structure it, it can be easier to store
Sharing Data
I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…
I need to combine AD and HD data…
I can help with that!
I can help with that!
Source: Marcos Roos
Who wants to share data?• People who want to use data• Funders• Publishers• But the researchers?
You only need MS-Excel
People hide data
• I did all this work I want to reuse• They don’t need this part, might be my next…• I might get a patent on this• Or… It needs a patent to be valuable• I can’t even patent because ...
How?
• Don’t add specifics(ohh those really were knockout cells, but..)
• Leave out important steps(I did these PCRs, why show the array)
• And “we used an approach slightly modified from…”
• ...
FAIR data
• Findable• Accessible• Interoperable• Reusable
Sharing Data
I would like to exploit common genotype-phenotype relations between Alzheimer’s Disease and Huntington’s Disease…
I need to combine AD and HD data…
I can help with that!
I can help with that!
Source: Marcos Roos
Sharing Data
Source: Marcos Roos
???
Here’s my data, have fun!
Here’s my data, have fun!
Sharing Linkable Data
Source: Marcos Roos
I can go straight to answering my questions with data from multiple data owners!
Patients will be so pleased with this speed-up!
Here’s my Linked Data,
have fun!Here’s my Linked Data,
have fun!
Really?
From terms “liver, hepar, hepatic tissue”
To URI’s: http://identifiers.org/tissueont1/liverhttp://identifiers.org/tissueont2/hepar….
Just a first step
And we didn’t even get that…
Reality:
Ontology inspired pull-down menu’s
Nothing is ever “same-as”
• We may need more meaningful predicates• Or learn to use the better• We need lenses, context matters
Too many standards
Source XKCD: https://xkcd.com/927/
Too many standards
And ontologies…
But they are there for a reason! Research fields have different focus/needs
Don’t standardise, map!
We need mapping
• Ontology mapping• Identifier mapping• Identity (text mapping)• Chemistry mapping
We need mapping
• Ontology mapping: NCBO• Identifier mapping: BridgeDb, IMS• Identity (text) mapping: Conceptwiki?• Chemistry mapping: CRS??
There is a lot out there
Discussed last Friday:Serum and adipose tissue amino acid homeostasis in
the MHO (Badoud 2014)– Objective: Integrate metabolite and gene expression profiling to elucidate the
molecular distinctions between Metabolically Healthy Obese (MHO) and Metabolically Unhealthy Obese (MUO)
• Conclusion: SAT gene expression profiling revealed that genes related to branched-chain amino acid catabolism and the tricarboxylic acid cycle were less down-regulated in MHO individuals compared to MUO individuals. Together, this integrated analysis revealed that MHO individuals have an intermediate amino acid homeostasis compared to LH and MUO individuals.
– (Diabetes Risk Assessment study) 3 groups: Lean Healthy (LH), MHO and MUO• Fasting serum samples from all participants and adipose tissue from the periumbilical region under local anesthesia after an
overnight fast
– Initially 30 participants, 10 in each group (7 women, 3 men), but for the Microarray Analysis they analyzed SAT from 7 LH, 8 MHO and 8 MUO each group having 2 men. Not very clear why->They selected samples having RNA integrity number higher than 8
– Gene expression data only for the 23 participants – No gender or biological information (e.g glucose, total triglycerides, etc)– Not initial serum metabolites concentration (only mean)– dx.doi.org/10.1021/pr500416v– Data can be found: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200
Discussed last Friday:Serum and adipose tissue amino acid homeostasis in
the MHO (Badoud 2014)– Objective: Integrate metabolite and gene expression profiling to elucidate the
molecular distinctions between Metabolically Healthy Obese (MHO) and Metabolically Unhealthy Obese (MUO)
• Conclusion: SAT gene expression profiling revealed that genes related to branched-chain amino acid catabolism and the tricarboxylic acid cycle were less down-regulated in MHO individuals compared to MUO individuals. Together, this integrated analysis revealed that MHO individuals have an intermediate amino acid homeostasis compared to LH and MUO individuals.
– (Diabetes Risk Assessment study) 3 groups: Lean Healthy (LH), MHO and MUO• Fasting serum samples from all participants and adipose tissue from the periumbilical region under local anesthesia after an
overnight fast
– Initially 30 participants, 10 in each group (7 women, 3 men), but for the Microarray Analysis they analyzed SAT from 7 LH, 8 MHO and 8 MUO each group having 2 men. Not very clear why->They selected samples having RNA integrity number higher than 8
– Gene expression data only for the 23 participants – No gender or biological information (e.g glucose, total triglycerides, etc)– Not initial serum metabolites concentration (only mean)– dx.doi.org/10.1021/pr500416v– Data can be found: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55200
Adding phenotypic data
Diversity, not size, makes big data hard
SAM module- small assays- diverse assays
For now annotation, used after you find it
Repositories are technology driven
• Expression data• Protein data• Metabolomics data• Genetic variation data
Repositories are technology driven
• Expression data: ArrayExpress, GEO• Protein data: PRIDE• Metabolomics data: MetaboLight• Genetic variation data: dbSNP
Start with the samples?
Or the studies?
ISA-tab inspiredinvestigations links to studieswhich link to assayssamplesand the actual data
Study capturing…
Capturing needs meta-ontologiesExamples:EFO (experimental factor ontology), eNanomapper (nanomaterials)
•Combine•Map•Slim•Extend•Feed extensions back to source•Reproduce from (extended) source
If you can find it in a database
Can you find the database?
Discoverable fairports?
What about institute repo’s?
If study in dbNP
• Large data in repo’s (e.g. MetaboLight)• Study descriptions still hidden
Combine with knowledge
• Can you find a study by the results?• Integrate results
(pathway and ontology profiles)
Challenges needed
Teams answering real questions
• Finds needs and solutions• Combines across communities• Fun! And inspiring• Interesting, publishable results
Starting a database is easy
• What about sustainability:• Core resources need:
– Long time funding– Regular monitoring
• Integration in communities