leonore reiser and lisa harper plantae webinar · • minimal information about a plant phenotyping...

Leonore Reiser and Lisa Harper

Plantae WebinarMay 30, 2018

What does it mean to be FAIR? Why is it so

important?

How to make your published work more

FAIR

Planning your data management strategy

Stay to the end to complete a survey about

future webinars

Community databases/knowledgebases

Researchers

Data type specific repositories

http://www.gramene.org/

http://www.gramene.org/

https://www.rosaceae.org/

https://www.rosaceae.org/

https://www.google.com/url?sa=i&source=images&cd=&ved=2ahUKEwi-59CkrqHbAhXD-lQKHT2fBBAQjRx6BAgBEAU&url=http://www.geneontology.org/page/go-citation-policy&psig=AOvVaw16_2nUNZaxTxpQOstRQgSS&ust=1527354830033234

https://www.google.com/url?sa=i&source=images&cd=&ved=2ahUKEwi-59CkrqHbAhXD-lQKHT2fBBAQjRx6BAgBEAU&url=http://www.geneontology.org/page/go-citation-policy&psig=AOvVaw16_2nUNZaxTxpQOstRQgSS&ust=1527354830033234

What’s in it for YOU?

We all benefit from data sharing.

More citations of YOUR work, increasing

your visibility in the research community.

Easily comply with journal and

funding requirements

Less time spent fulfilling requests for data.

ACGATTGAAGAGAGACTTAAAGTGGTGGAATAAGCACATTTTTGGAGATATTTTTAAAATCCTCCGATTG

GCAGAAGTTGAAGCTGAACAAAGAGAATTGAATTTCCAACAGAATCCCTCAGCAGCTAATAGAGAATTGA

TGCATAAGGCTTATGCCAAACTTAACCGGCAGTTAAGTATTGAAGAACTTTTTTGGCAACAAAAGTCGGG

TGTCAAATGGTTAGTGGAGGGGGAACGCAACACCAAATTTTTTCATATGAGGATGCGTAAAAAAAGAATG

AGAAATCACATCTTCCGGATTCAGGATCAGGAAGGGAATGTGCTTGAAGAACCTCATTTAATCCAAAACT

CGGGTGTTGAATTCTTTCAAAACTTGCTGAAGGCAGAACAATGTGACATCTCCAGGTTTGATCCTTCTAT

TACTCCACGAATTATCTCCACCACTGATAATGAATTCTTGTGTGCAACCCCATCGTTACAGGAAGTGAAA

GAGGCAGTATTTAACATTAATAAAGATAGTGTCGCTGGGCCTGACGGTTTCTCATCCTTGTTTTACCAAC

ACTGCTGGGACATAATCAAGCAAGACCTTTTTGAAGCAGTGCTTGATTTTTTCAAGGGGAGCCCGCTACC

ACGTGGCATTACCTCCACAACGCTTGTCTTGTTACCTAAAACTCAGAATGTCAGCCAATGGAGTGAATTT

CGGCCCATTAGTTTATGCACTGTCTTAAACAAGATAGTAACTAAACTTTTGGCCAACCGGCTATCCAAAA

TTCTCCCATCCATCATCTCAGAAAACCAAAGTGGCTTCGTTAATGGAAGGCTTATAAGTGACAATATCTT

GCTTGCACAGGAGCTGGTTGATAAGATTAATGCAAGATCAAGGGGAGGTAATGTGGTCCTAAAACTTGAT

ATGGCAAAAGCTTATGACCGTCTGAATTGGGAATTTCTTTATCTTATGATGGAGCAGTTTGGTTTTAATG

CACTTTGGATAAACATGATTAAGGCCTGCATCTCCAACTGTTGGTTTTCATTACTCATCAATGGATCCTT

AGTGGGCTATTTCAAATCCGAGAGGGGACTGAGACAGGGCGATTCTATTTCCCCTTCGCTTTTTATCTTG

GCTGCAGAATATTTATCAAGGGGACTCAATCAGTTATTCAGCCGCTACAATTCTTTACATTACTTATCTG

GATGTTCCATGTCTGTGAGTCACCTTGCTTTTGCCGATGATATTGTAATTTTTACTAATGGTTGCCACTC

AGCCTTGCAGAAGATCTTGGTCTTCTTACAGGAATATGAACAGGTATCGGGGCAACAGGTTAATCATCAA

What types of Data are we talking about?

• We used to publish all the data we needed to prove a hypothesis within a publication

• But things have changed:• Some data is now too large for inclusion in a publication

• Data can now be computationally analyzed, so it must be machine readable

Zhang et al, Plant Cell, 2018. doi.org/10.1105/tpc.17.00791

Photos of specimens

Data that can be included in Publications

Data OK in primary publication

Data goes in appropriate, stable, long term repository

Guo et al, Plant Cell, 2018. doi.org/10.1105/tpc.17.00842

Gel images, charts and graphs

Caseys et al, Plant Cell 2018. doi.org/10.1105/tpc.18.00278

Model cartoons

Kumar, et al, Plant Physiology 2018, doi.org/10.1104/pp.18.00263

SHORT lists of

primers in text format

(not pdf)



Data that can be included in Publications



• Genome Assemblies

• RNAseq/ChIPseq/OtherSeq

• QTL data (bi-parental or GWAS)/ SNPs/INDELs

• Other Genome Diversity data

• Proteomics

• Metabolomics

• Ionomics

• Etc.

Data TOO BIG for Publications

These Data Types need ADDITIONAL

attention to be FAIR

Data TOO BIG for PublicationsTwo Examples of Big Data in publications:

1. Paper reports on a new genome sequence assembly:

That genome sequence MUST be made available; should

be submitted to Genbank Genome.

2. Paper used RNAseq to show that expression of their gene

of interest is altered under a certain condition. Only a

subset needs to be shown, but ALL the RNAseq data is

valuable. Publish the paper AND publish the RNAseq

Data! You get TWO publications instead of one!

Credit: Melissa Haendel

Wilkinson, et al., (2016) The FAIR Guiding Principles for scientific data management and stewardship

10.1038/sdata.2016.18. https://www.nature.com/articles/sdata201618

• Findable means data is human and machine readable

and attached to persistent identifiers

• Accessible means data can be found and retrieved by

humans and machines using standard formats

• Interoperable means data can be exchanged and used

between systems.

• Reusable means data can be used by others

How to Make Your Published Data FAIR

• Use standard formats

• Supply complete metadata

• Embrace Ontologies

• Use persistent and unambiguous identifiers

• Put your data in a long term stable repository

• Cite, share freely and encourage others

CHROM POS REF ALT Line 1

Line2

1 12345 A C A A

3 67891 C T H C

10 23456 G T T U


Line 2

Gm01 12345 A C 0/0 0/0

Gm03 67891 C T 0/1 0/0

Gm10 23456 G T 1/1 ./.


Line2

Chr01 12345 A C AA AA

Chr03 67891 C T C/T CC

Chr10 23456 G T TT NN

ALL MEAN THE SAME!

BUT ARE NOT THE SAME

Use Standard formats: SNP example

SNP (Single Nucleotide Polymorphism): A base, a chromosome

number and genome position, and a reference to the genome

assembly used, and the genotypes of lines tested.

VCF: Variant Call Format

Is the STANDARD

Use the File format

STANDARD

for your data type

DOI:/10.3389/fpls.2017.01812

Use Standard formats: Data in images is NOT accessible

Data in PDF (image) format

is not findable or

accessible.

Leave tabular data in tables

https://doi.org/10.3389/fpls.2017.01812

If you use EXCEL, look out for data corruption and hidden Microsoft characters that impede parsing

Zeimann, 2016

10.1186/s13059-016-1044-7

Use Standard formats: Beware of Excel

Fig. 1: Prevalence of gene name errors in Supplementary Excel files

Percentage of papers with gene lists effected Increase in supplementary files with gene

name errors per year



• Supply complete metadata





Metadata: Species = xxx

Germplasm = xxx

Field location = xxx

Environment = xxx

Measurement = xxx

method

Phenotype (Data): Plant is 170cm tall

Metadata is data about the data,and allows understanding of the data

Supply Complete Metadata

• Write your Materials and Methods as if you wanted someone else to be able to reproduce your work.

• Be accurate and complete about your bench and field work; include samples/stocks/lines used, accession numbers, sources of materials, exact measuring techniques etc.

• Be AS accurate and complete about your computational pipelines. Include your created raw data files and versions. If you use reference data (eg; sequence assembly), include the version number, download dates, and download source.

• Include names of software applications, versions, platforms and source. If you use a CyVerse, use their metadata reporting tools.


Supply Complete Metadata: Example

Pretty Good

Pretty Goodbut lots of metadata

in free text


Not so good

But NONE of those are really great…

597 Possible Attributes

At least 50 Attributes

At least 100 Attributes

Budget TIME

to provide Metadata

The metadata in public databases is often confusing

and very incomplete

A test case with Zea mays RNAseq data reveals a high proportion

of missing, misleading or incomplete metadata.Bhandary, et al, Plant Science 2018. Raising orphans from a metadata morass: A researcher's

guide to re-use of public ’omics data. https://doi.org/10.1016/j.plantsci.2017.10.014

• Established: Genomic Standards Consortium (http://gensc.org)

• Minimal Information about Any Sequence• Emerging

• Minimal Information about a Plant Phenotyping Experiment (MIAPPE)

Metadata Standards for Various Data Types

Ask For Help from Database People



• Supply complete and deep metadata





Cell

Same word,

different meanings

Different words, same concept

Eggplant

Aubergine

Melongene

Credit: Monica Munoz Torres

An Ontology is:

A set of precisely defined terms

in a logical hierarchy, and the

relationship between terms can be

understood by computers

PO:0020105ligule

Ontologies: Hierarchy of terms and

explicit relationship among terms

Plant

Ontology

(PO)

Ligule

PO:0020105

Vascular leaf

PO:0009025

Leaf sheath

PO:0020104

Flag leaf

PO:0020103

Adult vascular leaf

PO:0020103

Leaf

PO:0025034

Embracing ontologies

• Ontologies provide a POWERFUL, MACHINE READABLE utility to ensure we are all speaking the same language

• Examples of ontologies:

• Gene Function = Gene Ontology (GO)

• Plant Anatomy and Development = Plant Ontology (PO)

• Phenotypes = Phenotype and Trait Ontology (PATO)

• …..many many others

• Find and use existing ontologies:• http://bioportal.bioontology.org// (711 ontologies)

• https://www.ebi.ac.uk/ols/index (208 ontologies)

• Planteome (http://planteome.org)

• Ask Questions!

http://planteome.org

Using Ontologies in Metadata

Questions?

Use persistent, unambiguous identifiers

Example: Gene names

GOOD!

Identifiers also resolve confusion over species

Is this Arabidopsis? Maize? Tomato?

DOI:10/24/pp.17.00021

One gene- many names

GOOD

OK

(history)

One name- many genes

A ‘gene’ may have many named sequences

Community Standards and Nomenclature Resources

Put your data in a stable public repository

Large International Repositories for many data

types for all species. ALL sequence data goes here

Large but specialized databases serving many species

Soybase

Specialized databases serving specific communities

Which repository?

ALL types of sequence Data: NCBI, DDBJ, or

EMBL

SNP data: European Variation Archive (EVA,

https://www.ebi.ac.uk/eva/).

(NCBI’s dbSNP will only process Human SNPs)

RNAseq: GEO

(https://www.ncbi.nlm.nih.gov/geo/), Array

Express (https://www.ebi.ac.uk/arrayexpress/)

Re3data- searching repositories

https://www.re3data.org/

FAIRsharing- searching repositories

https://fairsharing.org/

FAIRsharing- searching metadata standards

https://fairsharing.org/

What if there is no specialized database?Or no recommendations from journals ?

You should get a Digital Object Identifier (DOI)

http://datadryad.org

** Curated, metadata

https://zenodo.org/

https://figshare.com/

https://datashare.ucsf.edu/stash

And institutional repositories

http://datadryad.org/

But.. please, don’t forget to actually complete your submission*...

*And you never have to spend time fielding requests

or transferring huge data files again

Data Management PlanningWhat external sources of data will I be using?

Where will it come from?

Are there restrictions on reuse?

What types of data will I be generating?

What sort of metadata do I need to collect?

How will I structure and store my data?

Are there existing data handling standards and tools?

Where will the data reside when my project is done?

Is there a repository that can handle my data?

What metadata and files do I need to provide and how?

If I plan to host the data myself on a website/database, how

will I maintain it, and for how long? What happens then?

Under what terms will others be able to reuse my data?

If I want to, how will I be able to track how my data is

reused?

Cite, share freely and encourage others to be FAIR

Include searchable and citable identifiers for your data in your papers

Release your data with clearly defined terms of use

e.g. Creative Commons (CC) CC-0, CC-BY

(https://creativecommons.org)

If you do not specify restrictions may be implied, limiting reuse

Cite all of your data sources

Enhances reproducibility….. and also shows value to funders!

When reviewing papers check them for FAIRness

Good data practices benefit everyone (and help you get funded)

NSF considers the Data Management Plan (DMP) to be an integral part of all full proposals

(http://www.nsf.gov/bfa/dias/policy/dmp.jsp1), that will be “considered under Intellectual Merit or

Broader Impacts or both, as appropriate for the scientific community of relevance” (PAPPG, pg. II-212).

BIO recognizes that different research communities may have their own data management practices

and standards; that these norms will change over time; and, that lifecycles of usefulness will vary for

different data types. As such, it is essential for scientific communities to guide needed standards

development, and to shape expectations for sharing or archiving.

https://www.nsf.gov/bio/pubs/BIODMP102015.pdf

Thank you!Please complete the survey

AgBioData

leonore reiser and lisa harper plantae webinar · • minimal information about a plant phenotyping...

Documents