biohackathon2013: tripling bioinformatics productivity

39
Tripling Tripling Bioinformatics Bioinformatics Productivity Productivity Jerven Bolleman Developer UniProtKB/Swiss-Prot

Upload: jervenbolleman

Post on 15-Jan-2015

3.788 views

Category:

Education


0 download

DESCRIPTION

Talking about RDF/SPARQL and what it means for bioinformatics. The main point is that SPARQL is an universal API to data.

TRANSCRIPT

Page 1: Biohackathon2013: Tripling Bioinformatics Productivity

TriplingTriplingBioinformaticsBioinformatics

ProductivityProductivity

Jerven Bolleman

Developer

UniProtKB/Swiss-Prot

Page 2: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Thank you

Page 3: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

UniProt.rdf SPARQL

Page 4: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

UniProt.rdf SPARQL

Page 5: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Page 6: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Data first

• Biocuration– Recover information ‘lost’ in papers

• curation ≠ data entry– Extract knowledge from data

• Structuring knowledge– to integrate with related data– to answer further questions

Page 7: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Biocuration

Page 8: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

• And the rock gets – larger every day

Biocuration

Page 9: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

MADNESS !

THIS iS Swiss-Prot !

Page 10: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

63% more triples in a year

Page 11: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Make data retrieval worthwhile

• If your data is not easily accessible, then no one will query it.

• Simple would be nice, but:– you cannot make it simpler than your data– if the biology is difficult, so is your database

• After retrieval you must:– visualize– summarize

Page 12: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL?Give me a betterpipette

Page 13: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Visualization is work

Page 14: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Visualization is work

Page 15: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Page 16: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

www.ebi.ac.uk/fgpt/gwas/

Page 17: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

SPARQL

CSV

SERVICE

UniProt.rdf SPARQL

18

Page 18: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQLor

CLAY

Page 19: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Progression of query languages

SQLXPath

XQuerySPARQL

Standardized1986

-

2011

1999-

2008

2008-

2013

SPARQL

Page 20: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SQL is not standardized

• 7th ISO standard version• Yet...

– SHOW TABLES– SELECT table_name FROM user_tables– LIST TABLES

• Schemas are not fully transferable – VARCHAR2 or VARCHAR or CHAR or TEXT...

SPARQL

Page 21: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

XPath/Xquery

• Fully standardized– Also in the marketplace

• Tree-based document query model– Assumes all data is in one document

SPARQL

Page 22: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL

• Fully standardized– Also in the marketplace

• Graph-based document query model– Assumes all data is reachable via the internet– Assumes nothing about the storage model

SPARQL

Page 23: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against

• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...

• Programs– SADI...

• Triplestore– Mark logic, OWLIM, uRiKA, Oracle spatial or NoSQL...

• Key-value– Redis

• Bioinformatics flat file formats– sparql-bed

• CSV/TSV/Spreadsheets– Tarql, Sparqlify

SPARQL

Page 24: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

SPARQL

UniProt.rdf SPARQL

CSV

SERVICE

Page 25: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSVbed file

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

chr7

127471196

127472363

pos1

0 +

127472363

127473530

pos2

SPARQL

Page 26: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

Start End

Page 27: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

faldo:start faldo:enda faldo:ExactPosition

Page 28: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

?start ?end

Page 29: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Sesame

Text

@Overridepublic

CloseableIteration getStatements(Resource subj,

URI pred, Value obj, Resource... namedgraph)

throws QueryEvaluationException {return new EmptyIteration();

}

Page 30: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Big(0) compared to other approaches

• If the SPARQL engine:– detects query is per CSV “line”

• O(number of lines)– else

• O(number of lines * number of joins)

• Same as – cat | perl -ne

Page 31: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

• Strengths– Isolates data format from querying– Easy to put data on the web

• (public SPARQL endpoints)– Single point of optimization

• e.g. parallel query execution– Other programs can still access data

• Weaknesses– Time to code SPARQL to CSV translation– Latency– Harder to hack the code to see what is going on

• (no pipe > to temporary file)

Page 32: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Doing this in PERL

wget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed | perl -ane'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`} $alt_bases = $patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"}; chomp $alt_bases;print join("\t", @F[0..4], $1), "\n" if $F[4] eq $alt_bases and /MAF=(\d\.\d+)/'

Page 33: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;

faldo:begin ?patientBegin ; faldo:end   ?patientEnd ; rdf:value ?patientValue .

?mutationType rdfs:subClassOf :mutation .SERVICE

<ftp://ftp.ncbi.../human_9606/VCF/00-All.vcf.gz>{?dbSnp a ?mutationType ;

faldo:begin ?patientBegin ;            faldo:end   ?patientEnd ;

rdf:value ?patientValue ; :MinorAlleleFrequency ?maf .}}

Doing this in SPARQL

Page 34: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

At your SERVICE

Page 35: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SELECT ?doi ?citatingDoiWHERE{ uniprot:P06280 up:annotation ?annotation ;              up:citation ?citation . ?citation dc:identifier ?doiRaw ; up:name "Nature" . ?annotation a up:Disease_Annotation . BIND (substr(?doiRaw, 5) as ?doi) SERVICE<http://data.nature.com/sparql>{  ?article prism:doi ?doi ; nature:hasCitation ?citationCitingCitation . ?citationCitingCitation prism:doi ?citatingDoi  }}

Page 36: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Benefits of SERVICE

• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset = expensive

• SPARQL viable via the web– 400GB of UniProt data can stay at UniProt– Your NGS data can stay in your data centre

• Easiest data compression is avoiding a 100 copies ;)

Page 37: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Network of SPARQL endpoints

• Like a social network– value increases the more members there are

Page 38: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Network of SPARQL endpoints

• Like a social network– value increases the more members there are

Page 39: Biohackathon2013: Tripling Bioinformatics Productivity

4242