database integration to improve accessibility to high-throughput sequence data
Post on 14-Jun-2015
104 Views
Preview:
DESCRIPTION
TRANSCRIPT
Database Integration to Improve Accessibility to
High-Throughput Seq Data
TAZRO OHTA @inutano
What do you imagine with a term
“Database”?
🙆
Knowledge Scientific data Experimental data
💡
🔎
Knowledge base Database Raw Data repository
💡
🔎
Knowledge base Database Raw Data repository
💡
🔎
What kind of data?
Next-generation is already out there…
We all need
Raw data repo for
NGS
We’ve already seen
WHY WE NEED
Reproducibility is what makes science fair.
2 things required for data repository is…
1: Reliability Data should be archived correctly, with explicit metadata
2: Accessibility Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with explicit metadata
2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with explicit metadata
2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
1: Reliability needs curation Data should be archived correctly, with explicit metadata
2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick
Current Web-interface for DRAhttp://trace.ddbj.nig.ac.jp/DRASearch
Good: Simple, Fast, and no bugs (!)
Challenge: Lack of metadata caused “NOT FOUND”
PROBLEM:
???
DRASearch can NOT find
Data without metadata …but they definitely exist in the repo.
Too many to ask submitters;
then we implemented 🔨 a system to
make metadata rich enough
2 sources into DRA📦
📦📦
DDBJ Read Archive
Publications can have details of seq process,
Seq Read Quality can be a source of data quality.
📦
📦📦
DDBJ Read Archive
PubMed PMC
Extracted Read Quality
And then: integration enables to implement
Efficient Data Search
Available via DBCLS SRAhttp://sra.dbcls.jp/
Available via DBCLS SRAhttp://sra.dbcls.jp/
Available via DBCLS SRAhttp://sra.dbcls.jp/
Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search
Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search
Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search
83% seq reads satisfied
average quality over 30
0.03% of seq reads fall into over 50% N content
💀
👍
1: Reliability from paper/data qual more description brings more proof.
2: Accessibility from text-search Search included publication brings flexibility.
2.20% of submitted projects has at least one publication
📦 📰4429 / 201558
PROBLEM:
NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx
What is
Next-step to carry on?
1: Beyond Raw Data Archive is going to handle alignment data.
2: Analysis Reproducibility Public repo for analysis pipeline is required.
👯1: Beyond Raw Data Archive is going to handle alignment data.
2: Analysis Reproducibility Public repo for analysis pipeline is required.
Database is for Biologists
not for developers.
Thank you! t.ohta@dbcls.rois.ac.jp
http://speakerdeck.com/inutano
top related