biohackathon2013: tripling bioinformatics productivity
DESCRIPTION
Talking about RDF/SPARQL and what it means for bioinformatics. The main point is that SPARQL is an universal API to data.TRANSCRIPT
TriplingTriplingBioinformaticsBioinformatics
ProductivityProductivity
Jerven Bolleman
Developer
UniProtKB/Swiss-Prot
© 2013 SIB
Thank you
© 2013 SIB
UniProt.rdf
UniProt.rdf SPARQL
© 2013 SIB
UniProt.rdf
UniProt.rdf SPARQL
© 2013 SIB
© 2013 SIB
Data first
• Biocuration– Recover information ‘lost’ in papers
• curation ≠ data entry– Extract knowledge from data
• Structuring knowledge– to integrate with related data– to answer further questions
© 2013 SIB
Biocuration
© 2013 SIB
• And the rock gets – larger every day
Biocuration
© 2013 SIB
MADNESS !
THIS iS Swiss-Prot !
© 2013 SIB
63% more triples in a year
© 2013 SIB
Make data retrieval worthwhile
• If your data is not easily accessible, then no one will query it.
• Simple would be nice, but:– you cannot make it simpler than your data– if the biology is difficult, so is your database
• After retrieval you must:– visualize– summarize
© 2013 SIB
SPARQL?Give me a betterpipette
© 2013 SIB
Visualization is work
© 2013 SIB
Visualization is work
© 2013 SIB
© 2013 SIB
UniProt.rdf
SPARQL
CSV
SERVICE
UniProt.rdf SPARQL
18
© 2013 SIB
SPARQLor
CLAY
© 2013 SIB
Progression of query languages
SQLXPath
XQuerySPARQL
Standardized1986
-
2011
1999-
2008
2008-
2013
SPARQL
© 2013 SIB
SQL is not standardized
• 7th ISO standard version• Yet...
– SHOW TABLES– SELECT table_name FROM user_tables– LIST TABLES
• Schemas are not fully transferable – VARCHAR2 or VARCHAR or CHAR or TEXT...
SPARQL
© 2013 SIB
XPath/Xquery
• Fully standardized– Also in the marketplace
• Tree-based document query model– Assumes all data is in one document
SPARQL
© 2013 SIB
SPARQL
• Fully standardized– Also in the marketplace
• Graph-based document query model– Assumes all data is reachable via the internet– Assumes nothing about the storage model
SPARQL
© 2013 SIB
SPARQL against
• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...
• Programs– SADI...
• Triplestore– Mark logic, OWLIM, uRiKA, Oracle spatial or NoSQL...
• Key-value– Redis
• Bioinformatics flat file formats– sparql-bed
• CSV/TSV/Spreadsheets– Tarql, Sparqlify
SPARQL
© 2013 SIB
UniProt.rdf
SPARQL
UniProt.rdf SPARQL
CSV
SERVICE
© 2013 SIB
SPARQL against CSVbed file
chr7 127471196 127472363 Pos1 0 + 127471196127472363
chr7 127472363 127473530 Pos2 0 + 127472363127473530
chr7
127471196
127472363
pos1
0 +
127472363
127473530
pos2
SPARQL
© 2013 SIB
SPARQL against CSV
• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)
• CSV is a relation between fields via headers
SPARQL
chr7 127471196 127472363 Pos1 0 + 127471196127472363
chr7 127472363 127473530 Pos2 0 + 127472363127473530
Start End
© 2013 SIB
SPARQL against CSV
• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)
• CSV is a relation between fields via headers
SPARQL
chr7 127471196 127472363 Pos1 0 + 127471196127472363
chr7 127472363 127473530 Pos2 0 + 127472363127473530
faldo:start faldo:enda faldo:ExactPosition
© 2013 SIB
SPARQL against CSV
• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)
• CSV is a relation between fields via headers
SPARQL
chr7 127471196 127472363 Pos1 0 + 127471196127472363
chr7 127472363 127473530 Pos2 0 + 127472363127473530
?start ?end
© 2013 SIB
Sesame
Text
@Overridepublic
CloseableIteration getStatements(Resource subj,
URI pred, Value obj, Resource... namedgraph)
throws QueryEvaluationException {return new EmptyIteration();
}
© 2013 SIB
Big(0) compared to other approaches
• If the SPARQL engine:– detects query is per CSV “line”
• O(number of lines)– else
• O(number of lines * number of joins)
• Same as – cat | perl -ne
© 2013 SIB
• Strengths– Isolates data format from querying– Easy to put data on the web
• (public SPARQL endpoints)– Single point of optimization
• e.g. parallel query execution– Other programs can still access data
• Weaknesses– Time to code SPARQL to CSV translation– Latency– Harder to hack the code to see what is going on
• (no pipe > to temporary file)
© 2013 SIB
Doing this in PERL
wget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed | perl -ane'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`} $alt_bases = $patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"}; chomp $alt_bases;print join("\t", @F[0..4], $1), "\n" if $F[4] eq $alt_bases and /MAF=(\d\.\d+)/'
© 2013 SIB
SELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;
faldo:begin ?patientBegin ; faldo:end ?patientEnd ; rdf:value ?patientValue .
?mutationType rdfs:subClassOf :mutation .SERVICE
<ftp://ftp.ncbi.../human_9606/VCF/00-All.vcf.gz>{?dbSnp a ?mutationType ;
faldo:begin ?patientBegin ; faldo:end ?patientEnd ;
rdf:value ?patientValue ; :MinorAlleleFrequency ?maf .}}
Doing this in SPARQL
© 2013 SIB
At your SERVICE
© 2013 SIB
SELECT ?doi ?citatingDoiWHERE{ uniprot:P06280 up:annotation ?annotation ; up:citation ?citation . ?citation dc:identifier ?doiRaw ; up:name "Nature" . ?annotation a up:Disease_Annotation . BIND (substr(?doiRaw, 5) as ?doi) SERVICE<http://data.nature.com/sparql>{ ?article prism:doi ?doi ; nature:hasCitation ?citationCitingCitation . ?citationCitingCitation prism:doi ?citatingDoi }}
© 2013 SIB
Benefits of SERVICE
• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset = expensive
• SPARQL viable via the web– 400GB of UniProt data can stay at UniProt– Your NGS data can stay in your data centre
• Easiest data compression is avoiding a 100 copies ;)
© 2013 SIB
Network of SPARQL endpoints
• Like a social network– value increases the more members there are
© 2013 SIB
Network of SPARQL endpoints
• Like a social network– value increases the more members there are
4242