leonore reiser and lisa harper plantae webinar · • minimal information about a plant phenotyping...
TRANSCRIPT
Leonore Reiser and Lisa Harper
Plantae WebinarMay 30, 2018
What does it mean to be FAIR? Why is it so
important?
How to make your published work more
FAIR
Planning your data management strategy
Stay to the end to complete a survey about
future webinars
Community databases/knowledgebases
Researchers
Data type specific repositories
What’s in it for YOU?
We all benefit from data sharing.
More citations of YOUR work, increasing
your visibility in the research community.
Easily comply with journal and
funding requirements
Less time spent fulfilling requests for data.
ACGATTGAAGAGAGACTTAAAGTGGTGGAATAAGCACATTTTTGGAGATATTTTTAAAATCCTCCGATTG
GCAGAAGTTGAAGCTGAACAAAGAGAATTGAATTTCCAACAGAATCCCTCAGCAGCTAATAGAGAATTGA
TGCATAAGGCTTATGCCAAACTTAACCGGCAGTTAAGTATTGAAGAACTTTTTTGGCAACAAAAGTCGGG
TGTCAAATGGTTAGTGGAGGGGGAACGCAACACCAAATTTTTTCATATGAGGATGCGTAAAAAAAGAATG
AGAAATCACATCTTCCGGATTCAGGATCAGGAAGGGAATGTGCTTGAAGAACCTCATTTAATCCAAAACT
CGGGTGTTGAATTCTTTCAAAACTTGCTGAAGGCAGAACAATGTGACATCTCCAGGTTTGATCCTTCTAT
TACTCCACGAATTATCTCCACCACTGATAATGAATTCTTGTGTGCAACCCCATCGTTACAGGAAGTGAAA
GAGGCAGTATTTAACATTAATAAAGATAGTGTCGCTGGGCCTGACGGTTTCTCATCCTTGTTTTACCAAC
ACTGCTGGGACATAATCAAGCAAGACCTTTTTGAAGCAGTGCTTGATTTTTTCAAGGGGAGCCCGCTACC
ACGTGGCATTACCTCCACAACGCTTGTCTTGTTACCTAAAACTCAGAATGTCAGCCAATGGAGTGAATTT
CGGCCCATTAGTTTATGCACTGTCTTAAACAAGATAGTAACTAAACTTTTGGCCAACCGGCTATCCAAAA
TTCTCCCATCCATCATCTCAGAAAACCAAAGTGGCTTCGTTAATGGAAGGCTTATAAGTGACAATATCTT
GCTTGCACAGGAGCTGGTTGATAAGATTAATGCAAGATCAAGGGGAGGTAATGTGGTCCTAAAACTTGAT
ATGGCAAAAGCTTATGACCGTCTGAATTGGGAATTTCTTTATCTTATGATGGAGCAGTTTGGTTTTAATG
CACTTTGGATAAACATGATTAAGGCCTGCATCTCCAACTGTTGGTTTTCATTACTCATCAATGGATCCTT
AGTGGGCTATTTCAAATCCGAGAGGGGACTGAGACAGGGCGATTCTATTTCCCCTTCGCTTTTTATCTTG
GCTGCAGAATATTTATCAAGGGGACTCAATCAGTTATTCAGCCGCTACAATTCTTTACATTACTTATCTG
GATGTTCCATGTCTGTGAGTCACCTTGCTTTTGCCGATGATATTGTAATTTTTACTAATGGTTGCCACTC
AGCCTTGCAGAAGATCTTGGTCTTCTTACAGGAATATGAACAGGTATCGGGGCAACAGGTTAATCATCAA
What types of Data are we talking about?
• We used to publish all the data we needed to prove a hypothesis within a publication
• But things have changed:• Some data is now too large for inclusion in a publication
• Data can now be computationally analyzed, so it must be machine readable
Zhang et al, Plant Cell, 2018. doi.org/10.1105/tpc.17.00791
Photos of specimens
Data that can be included in Publications
Data OK in primary publication
Data goes in appropriate, stable, long term repository
Guo et al, Plant Cell, 2018. doi.org/10.1105/tpc.17.00842
Gel images, charts and graphs
Caseys et al, Plant Cell 2018. doi.org/10.1105/tpc.18.00278
Model cartoons
Kumar, et al, Plant Physiology 2018, doi.org/10.1104/pp.18.00263
SHORT lists of
primers in text format
(not pdf)
Data OK in primary publication
Data goes in appropriate, stable, long term repository
Data that can be included in Publications
Data OK in primary publication
Data goes in appropriate, stable, long term repository
• Genome Assemblies
• RNAseq/ChIPseq/OtherSeq
• QTL data (bi-parental or GWAS)/ SNPs/INDELs
• Other Genome Diversity data
• Proteomics
• Metabolomics
• Ionomics
• Etc.
Data TOO BIG for Publications
These Data Types need ADDITIONAL
attention to be FAIR
Data TOO BIG for PublicationsTwo Examples of Big Data in publications:
1. Paper reports on a new genome sequence assembly:
That genome sequence MUST be made available; should
be submitted to Genbank Genome.
2. Paper used RNAseq to show that expression of their gene
of interest is altered under a certain condition. Only a
subset needs to be shown, but ALL the RNAseq data is
valuable. Publish the paper AND publish the RNAseq
Data! You get TWO publications instead of one!
Credit: Melissa Haendel
Wilkinson, et al., (2016) The FAIR Guiding Principles for scientific data management and stewardship
10.1038/sdata.2016.18. https://www.nature.com/articles/sdata201618
• Findable means data is human and machine readable
and attached to persistent identifiers
• Accessible means data can be found and retrieved by
humans and machines using standard formats
• Interoperable means data can be exchanged and used
between systems.
• Reusable means data can be used by others
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
CHROM POS REF ALT Line 1
Line2
1 12345 A C A A
3 67891 C T H C
10 23456 G T T U
CHROM POS REF ALT Line 1
Line 2
Gm01 12345 A C 0/0 0/0
Gm03 67891 C T 0/1 0/0
Gm10 23456 G T 1/1 ./.
CHROM POS REF ALT Line 1
Line2
Chr01 12345 A C AA AA
Chr03 67891 C T C/T CC
Chr10 23456 G T TT NN
ALL MEAN THE SAME!
BUT ARE NOT THE SAME
Use Standard formats: SNP example
SNP (Single Nucleotide Polymorphism): A base, a chromosome
number and genome position, and a reference to the genome
assembly used, and the genotypes of lines tested.
VCF: Variant Call Format
Is the STANDARD
Use the File format
STANDARD
for your data type
DOI:/10.3389/fpls.2017.01812
Use Standard formats: Data in images is NOT accessible
Data in PDF (image) format
is not findable or
accessible.
Leave tabular data in tables
If you use EXCEL, look out for data corruption and hidden Microsoft characters that impede parsing
Zeimann, 2016
10.1186/s13059-016-1044-7
Use Standard formats: Beware of Excel
Fig. 1: Prevalence of gene name errors in Supplementary Excel files
Percentage of papers with gene lists effected Increase in supplementary files with gene
name errors per year
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
Metadata: Species = xxx
Germplasm = xxx
Field location = xxx
Environment = xxx
Measurement = xxx
method
Phenotype (Data): Plant is 170cm tall
Metadata is data about the data,and allows understanding of the data
Supply Complete Metadata
• Write your Materials and Methods as if you wanted someone else to be able to reproduce your work.
• Be accurate and complete about your bench and field work; include samples/stocks/lines used, accession numbers, sources of materials, exact measuring techniques etc.
• Be AS accurate and complete about your computational pipelines. Include your created raw data files and versions. If you use reference data (eg; sequence assembly), include the version number, download dates, and download source.
• Include names of software applications, versions, platforms and source. If you use a CyVerse, use their metadata reporting tools.
Supply Complete Metadata
Supply Complete Metadata: Example
Pretty Good
Pretty Goodbut lots of metadata
in free text
Supply Complete Metadata
Not so good
But NONE of those are really great…
597 Possible Attributes
At least 50 Attributes
At least 100 Attributes
Budget TIME
to provide Metadata
The metadata in public databases is often confusing
and very incomplete
A test case with Zea mays RNAseq data reveals a high proportion
of missing, misleading or incomplete metadata.Bhandary, et al, Plant Science 2018. Raising orphans from a metadata morass: A researcher's
guide to re-use of public ’omics data. https://doi.org/10.1016/j.plantsci.2017.10.014
• Established: Genomic Standards Consortium (http://gensc.org)
• Minimal Information about Any Sequence• Emerging
• Minimal Information about a Plant Phenotyping Experiment (MIAPPE)
Metadata Standards for Various Data Types
Ask For Help from Database People
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
Cell
Same word,
different meanings
Different words, same concept
Eggplant
Aubergine
Melongene
Credit: Monica Munoz Torres
An Ontology is:
A set of precisely defined terms
in a logical hierarchy, and the
relationship between terms can be
understood by computers
PO:0020105ligule
Ontologies: Hierarchy of terms and
explicit relationship among terms
Plant
Ontology
(PO)
Ligule
PO:0020105
Vascular leaf
PO:0009025
Leaf sheath
PO:0020104
Flag leaf
PO:0020103
Adult vascular leaf
PO:0020103
Leaf
PO:0025034
Embracing ontologies
• Ontologies provide a POWERFUL, MACHINE READABLE utility to ensure we are all speaking the same language
• Examples of ontologies:
• Gene Function = Gene Ontology (GO)
• Plant Anatomy and Development = Plant Ontology (PO)
• Phenotypes = Phenotype and Trait Ontology (PATO)
• …..many many others
• Find and use existing ontologies:• http://bioportal.bioontology.org// (711 ontologies)
• https://www.ebi.ac.uk/ols/index (208 ontologies)
• Planteome (http://planteome.org)
• Ask Questions!
Using Ontologies in Metadata
Questions?
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
Use persistent, unambiguous identifiers
Example: Gene names
GOOD!
Identifiers also resolve confusion over species
Is this Arabidopsis? Maize? Tomato?
DOI:10/24/pp.17.00021
One gene- many names
GOOD
OK
(history)
One name- many genes
A ‘gene’ may have many named sequences
Community Standards and Nomenclature Resources
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
Put your data in a stable public repository
Large International Repositories for many data
types for all species. ALL sequence data goes here
Large but specialized databases serving many species
Soybase
Specialized databases serving specific communities
Which repository?
ALL types of sequence Data: NCBI, DDBJ, or
EMBL
SNP data: European Variation Archive (EVA,
https://www.ebi.ac.uk/eva/).
(NCBI’s dbSNP will only process Human SNPs)
RNAseq: GEO
(https://www.ncbi.nlm.nih.gov/geo/), Array
Express (https://www.ebi.ac.uk/arrayexpress/)
Re3data- searching repositories
https://www.re3data.org/
FAIRsharing- searching repositories
https://fairsharing.org/
FAIRsharing- searching metadata standards
https://fairsharing.org/
What if there is no specialized database?Or no recommendations from journals ?
You should get a Digital Object Identifier (DOI)
http://datadryad.org
** Curated, metadata
https://zenodo.org/
https://figshare.com/
https://datashare.ucsf.edu/stash
And institutional repositories
But.. please, don’t forget to actually complete your submission*...
*And you never have to spend time fielding requests
or transferring huge data files again
Data Management PlanningWhat external sources of data will I be using?
Where will it come from?
Are there restrictions on reuse?
What types of data will I be generating?
What sort of metadata do I need to collect?
How will I structure and store my data?
Are there existing data handling standards and tools?
Where will the data reside when my project is done?
Is there a repository that can handle my data?
What metadata and files do I need to provide and how?
If I plan to host the data myself on a website/database, how
will I maintain it, and for how long? What happens then?
Under what terms will others be able to reuse my data?
If I want to, how will I be able to track how my data is
reused?
How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
Cite, share freely and encourage others to be FAIR
Include searchable and citable identifiers for your data in your papers
Release your data with clearly defined terms of use
e.g. Creative Commons (CC) CC-0, CC-BY
(https://creativecommons.org)
If you do not specify restrictions may be implied, limiting reuse
Cite all of your data sources
Enhances reproducibility….. and also shows value to funders!
When reviewing papers check them for FAIRness
Good data practices benefit everyone (and help you get funded)
NSF considers the Data Management Plan (DMP) to be an integral part of all full proposals
(http://www.nsf.gov/bfa/dias/policy/dmp.jsp1), that will be “considered under Intellectual Merit or
Broader Impacts or both, as appropriate for the scientific community of relevance” (PAPPG, pg. II-212).
BIO recognizes that different research communities may have their own data management practices
and standards; that these norms will change over time; and, that lifecycles of usefulness will vary for
different data types. As such, it is essential for scientific communities to guide needed standards
development, and to shape expectations for sharing or archiving.
https://www.nsf.gov/bio/pubs/BIODMP102015.pdf
Thank you!Please complete the survey
AgBioData