database integration to improve accessibility to high-throughput sequence data

Post on 14-Jun-2015

104 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

National Institute of Genetics Retreat 2014 Tech Seminar #6

TRANSCRIPT

Database Integration to Improve Accessibility to

High-Throughput Seq Data

TAZRO OHTA @inutano

What do you imagine with a term

“Database”?

🙆

Knowledge Scientific data Experimental data

💡

🔎

Knowledge base Database Raw Data repository

💡

🔎

Knowledge base Database Raw Data repository

💡

🔎

What kind of data?

Next-generation is already out there…

We all need

Raw data repo for

NGS

We’ve already seen

WHY WE NEED

Reproducibility is what makes science fair.

2 things required for data repository is…

1: Reliability Data should be archived correctly, with explicit metadata

2: Accessibility Data should be able to be accessed by anyone, without special trick

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Current Web-interface for DRAhttp://trace.ddbj.nig.ac.jp/DRASearch

Good: Simple, Fast, and no bugs (!)

Challenge: Lack of metadata caused “NOT FOUND”

PROBLEM:

???

DRASearch can NOT find

Data without metadata …but they definitely exist in the repo.

Too many to ask submitters;

then we implemented 🔨 a system to

make metadata rich enough

2 sources into DRA📦

📦📦

DDBJ Read Archive

Publications can have details of seq process,

Seq Read Quality can be a source of data quality.

📦

📦📦

DDBJ Read Archive

PubMed PMC

Extracted Read Quality

And then: integration enables to implement

Efficient Data Search

Available via DBCLS SRAhttp://sra.dbcls.jp/

Available via DBCLS SRAhttp://sra.dbcls.jp/

Available via DBCLS SRAhttp://sra.dbcls.jp/

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

83% seq reads satisfied

average quality over 30

0.03% of seq reads fall into over 50% N content

💀

👍

1: Reliability from paper/data qual more description brings more proof.

2: Accessibility from text-search Search included publication brings flexibility.

2.20% of submitted projects has at least one publication

📦 📰4429 / 201558

PROBLEM:

NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

What is

Next-step to carry on?

1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

👯1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

Database is for Biologists

not for developers.

Thank you! t.ohta@dbcls.rois.ac.jp

http://speakerdeck.com/inutano

top related