database integration to improve accessibility to high-throughput sequence data

Database Integration to Improve Accessibility to

High-Throughput Seq Data

TAZRO OHTA @inutano

What do you imagine with a term

“Database”?

Knowledge Scientific data Experimental data

Knowledge base Database Raw Data repository

What kind of data?

Next-generation is already out there…

We all need

Raw data repo for

We’ve already seen

WHY WE NEED

Reproducibility is what makes science fair.

2 things required for data repository is…

1: Reliability Data should be archived correctly, with explicit metadata

2: Accessibility Data should be able to be accessed by anyone, without special trick

1: Reliability needs curation Data should be archived correctly, with explicit metadata

2: Accessibility needs good interface Data should be able to be accessed by anyone, without special trick

Current Web-interface for DRAhttp://trace.ddbj.nig.ac.jp/DRASearch

Good: Simple, Fast, and no bugs (!)

Challenge: Lack of metadata caused “NOT FOUND”

PROBLEM:

DRASearch can NOT find

Data without metadata …but they definitely exist in the repo.

Too many to ask submitters;

then we implemented 🔨 a system to

make metadata rich enough

2 sources into DRA📦

📦📦

DDBJ Read Archive

Publications can have details of seq process,

Seq Read Quality can be a source of data quality.

📦📦

DDBJ Read Archive

PubMed PMC

Extracted Read Quality

And then: integration enables to implement

Efficient Data Search

Available via DBCLS SRAhttp://sra.dbcls.jp/

Power of Integration: Metadata Searchhttp://sra.dbcls.jp/search

83% seq reads satisfied

average quality over 30

0.03% of seq reads fall into over 50% N content

1: Reliability from paper/data qual more description brings more proof.

2: Accessibility from text-search Search included publication brings flexibility.

2.20% of submitted projects has at least one publication

📦 📰4429 / 201558

PROBLEM:

NIH Data sharing Guidelinehttp://www.niaid.nih.gov/LabsAndResources/resources/dmid/Pages/data.aspx

What is

Next-step to carry on?

1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

👯1: Beyond Raw Data Archive is going to handle alignment data.

2: Analysis Reproducibility Public repo for analysis pipeline is required.

Database is for Biologists

not for developers.

Thank you! t.ohta@dbcls.rois.ac.jp

http://speakerdeck.com/inutano

database integration to improve accessibility to high-throughput sequence data

accessibility data

reliability data

curation data

alignment data

nd data

kind of data

raw data archive

good interface data

Science

predicting protein solvent accessibility with sequence,...

user2012!: high-throughput sequence analysis with r and...

single-nucleotide polymorphism discovery by high...

complex relationships between chromatin accessibility...

high-quality genome sequence assembly of r.a73 enterococcus...

high-throughput biochemical profiling reveals sequence...

high-throughput multiplexed tandem repeat genotyping using...

high-throughput sequence analysis with r and...

high throughput computational sequence analysis rob edwards...

identifying reference genes with stable expression from high...

high-throughput sequence alignment using graphics...

motif enrichment analysis in co-expressed gene sets and...

low throughput hla typing protocol - immucor...

high-throughput sequence analysis with r and bioconductor

a complete neandertal mitochondrial genome sequence ... ·...

high throughput sequence analysis with mapreduce michael...

sequence and high throughput snp platform to breedingthe...

the bigot in the machine: data bias - insurance ireland ·...

high-throughput sequence analysis with r and · pdf...

functional classification of psi proteins to support high...