database integration to improve accessibility to high-throughput sequence data

45
Database Integration to Improve Accessibility to High-Throughput Seq Data

Upload: tazro-ohta

Post on 14-Jun-2015

104 views

Category:

Science


0 download

DESCRIPTION

National Institute of Genetics Retreat 2014 Tech Seminar #6

TRANSCRIPT

Page 1: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Database Integration to Improve Accessibility to

High-Throughput Seq Data

Page 2: Database Integration to Improve Accessibility to High-Throughput Sequence Data

TAZRO OHTA @inutano

Page 3: Database Integration to Improve Accessibility to High-Throughput Sequence Data
Page 4: Database Integration to Improve Accessibility to High-Throughput Sequence Data

What do you imagine with a term

“Database”?

Page 5: Database Integration to Improve Accessibility to High-Throughput Sequence Data
Page 6: Database Integration to Improve Accessibility to High-Throughput Sequence Data
Page 7: Database Integration to Improve Accessibility to High-Throughput Sequence Data

🙆

Page 8: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Knowledge Scientific data Experimental data

💡

🔎

Page 9: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Knowledge base Database Raw Data repository

💡

🔎

Page 10: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Knowledge base Database Raw Data repository

💡

🔎

Page 11: Database Integration to Improve Accessibility to High-Throughput Sequence Data

What kind of data?

Next-generation is already out there…

Page 12: Database Integration to Improve Accessibility to High-Throughput Sequence Data

We all need

Raw data repo for

NGS

Page 13: Database Integration to Improve Accessibility to High-Throughput Sequence Data

We’ve already seen

WHY WE NEED

Page 14: Database Integration to Improve Accessibility to High-Throughput Sequence Data
Page 15: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Reproducibility is what makes science fair.

Page 16: Database Integration to Improve Accessibility to High-Throughput Sequence Data

2 things required for data repository is…

Page 17: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Reliability Data should be archived correctly, with explicit metadata

2: Accessibility Data should be able to be accessed by anyone, without special trick

Page 18: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Page 19: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Page 20: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Page 21: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Current Web-interface for DRAhttp://trace.ddbj.nig.ac.jp/DRASearch

Page 22: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Good: Simple, Fast, and no bugs (!)

Challenge: Lack of metadata caused “NOT FOUND”

Page 23: Database Integration to Improve Accessibility to High-Throughput Sequence Data

PROBLEM:

Page 24: Database Integration to Improve Accessibility to High-Throughput Sequence Data

???

Page 25: Database Integration to Improve Accessibility to High-Throughput Sequence Data

DRASearch can NOT find

Data without metadata …but they definitely exist in the repo.

Page 26: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Too many to ask submitters;

then we implemented 🔨 a system to

make metadata rich enough

Page 27: Database Integration to Improve Accessibility to High-Throughput Sequence Data

2 sources into DRA📦

📦📦

DDBJ Read Archive

Page 28: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Publications can have details of seq process,

Seq Read Quality can be a source of data quality.

📦

📦📦

DDBJ Read Archive

PubMed PMC

Extracted Read Quality

Page 29: Database Integration to Improve Accessibility to High-Throughput Sequence Data

And then: integration enables to implement

Efficient Data Search

Page 30: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Available via DBCLS SRAhttp://sra.dbcls.jp/

Page 31: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Available via DBCLS SRAhttp://sra.dbcls.jp/

Page 32: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Available via DBCLS SRAhttp://sra.dbcls.jp/

Page 33: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

Page 34: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

Page 35: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

Page 36: Database Integration to Improve Accessibility to High-Throughput Sequence Data

83% seq reads satisfied

average quality over 30

0.03% of seq reads fall into over 50% N content

💀

👍

Page 37: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Reliability from paper/data qual more description brings more proof.

2: Accessibility from text-search Search included publication brings flexibility.

Page 38: Database Integration to Improve Accessibility to High-Throughput Sequence Data

2.20% of submitted projects has at least one publication

📦 📰4429 / 201558

PROBLEM:

Page 39: Database Integration to Improve Accessibility to High-Throughput Sequence Data

NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

Page 40: Database Integration to Improve Accessibility to High-Throughput Sequence Data

NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

Page 41: Database Integration to Improve Accessibility to High-Throughput Sequence Data

What is

Next-step to carry on?

Page 42: Database Integration to Improve Accessibility to High-Throughput Sequence Data

1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

Page 43: Database Integration to Improve Accessibility to High-Throughput Sequence Data

👯1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

Page 44: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Database is for Biologists

not for developers.

Page 45: Database Integration to Improve Accessibility to High-Throughput Sequence Data

Thank you! [email protected]

http://speakerdeck.com/inutano