biohackathon2013: tripling bioinformatics productivity

TriplingTriplingBioinformaticsBioinformatics

ProductivityProductivity

Jerven Bolleman

Developer

UniProtKB/Swiss-Prot

© 2013 SIB

Thank you

© 2013 SIB

UniProt.rdf

UniProt.rdf SPARQL

© 2013 SIB

Data first

• Biocuration– Recover information ‘lost’ in papers

• curation ≠ data entry– Extract knowledge from data

• Structuring knowledge– to integrate with related data– to answer further questions

© 2013 SIB

Biocuration

© 2013 SIB

• And the rock gets – larger every day

Biocuration

© 2013 SIB

MADNESS !

THIS iS Swiss-Prot !

© 2013 SIB

63% more triples in a year

© 2013 SIB

Make data retrieval worthwhile

• If your data is not easily accessible, then no one will query it.

• Simple would be nice, but:– you cannot make it simpler than your data– if the biology is difficult, so is your database

• After retrieval you must:– visualize– summarize

© 2013 SIB

SPARQL?Give me a betterpipette

© 2013 SIB

Visualization is work

© 2013 SIB

www.ebi.ac.uk/fgpt/gwas/

http://www.ebi.ac.uk/fgpt/gwas/

© 2013 SIB

UniProt.rdf

SPARQL

CSV

SERVICE

UniProt.rdf SPARQL

18

© 2013 SIB

SPARQLor

CLAY

© 2013 SIB

Progression of query languages

SQLXPath

XQuerySPARQL

Standardized1986

-

2011

1999-

2008

2008-

2013

SPARQL

© 2013 SIB

SQL is not standardized

• 7th ISO standard version• Yet...

– SHOW TABLES– SELECT table_name FROM user_tables– LIST TABLES

• Schemas are not fully transferable – VARCHAR2 or VARCHAR or CHAR or TEXT...

SPARQL

© 2013 SIB

XPath/Xquery

• Fully standardized– Also in the marketplace

• Tree-based document query model– Assumes all data is in one document

SPARQL

© 2013 SIB

SPARQL

• Fully standardized– Also in the marketplace

• Graph-based document query model– Assumes all data is reachable via the internet– Assumes nothing about the storage model

SPARQL

© 2013 SIB

SPARQL against

• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...

• Programs– SADI...

• Triplestore– Mark logic, OWLIM, uRiKA, Oracle spatial or NoSQL...

• Key-value– Redis

• Bioinformatics flat file formats– sparql-bed

• CSV/TSV/Spreadsheets– Tarql, Sparqlify

SPARQL

© 2013 SIB

UniProt.rdf

SPARQL

UniProt.rdf SPARQL

CSV

SERVICE

© 2013 SIB

SPARQL against CSVbed file

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

chr7

127471196

127472363

pos1

0 +

127472363

127473530

pos2

SPARQL

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

Start End

© 2013 SIB

SPARQL against CSV



SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

faldo:start faldo:enda faldo:ExactPosition

© 2013 SIB

SPARQL against CSV



SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

?start ?end

© 2013 SIB

Sesame

Text

@Overridepublic

CloseableIteration getStatements(Resource subj,

URI pred, Value obj, Resource... namedgraph)

throws QueryEvaluationException {return new EmptyIteration();

}

© 2013 SIB

Big(0) compared to other approaches

• If the SPARQL engine:– detects query is per CSV “line”

• O(number of lines)– else

• O(number of lines * number of joins)

• Same as – cat | perl -ne

© 2013 SIB

• Strengths– Isolates data format from querying– Easy to put data on the web

• (public SPARQL endpoints)– Single point of optimization

• e.g. parallel query execution– Other programs can still access data

• Weaknesses– Time to code SPARQL to CSV translation– Latency– Harder to hack the code to see what is going on

• (no pipe > to temporary file)

© 2013 SIB

Doing this in PERL

wget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed | perl -ane'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`} $alt_bases = $patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"}; chomp $alt_bases;print join("\t", @F[0..4], $1), "\n" if $F[4] eq $alt_bases and /MAF=(\d\.\d+)/'

© 2013 SIB

SELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;

faldo:begin ?patientBegin ; faldo:end ?patientEnd ; rdf:value ?patientValue .

?mutationType rdfs:subClassOf :mutation .SERVICE

<ftp://ftp.ncbi.../human_9606/VCF/00-All.vcf.gz>{?dbSnp a ?mutationType ;

faldo:begin ?patientBegin ; faldo:end ?patientEnd ;

rdf:value ?patientValue ; :MinorAlleleFrequency ?maf .}}

Doing this in SPARQL

© 2013 SIB

SELECT ?doi ?citatingDoiWHERE{ uniprot:P06280 up:annotation ?annotation ; up:citation ?citation . ?citation dc:identifier ?doiRaw ; up:name "Nature" . ?annotation a up:Disease_Annotation . BIND (substr(?doiRaw, 5) as ?doi) SERVICE<http://data.nature.com/sparql>{ ?article prism:doi ?doi ; nature:hasCitation ?citationCitingCitation . ?citationCitingCitation prism:doi ?citatingDoi }}

© 2013 SIB

Benefits of SERVICE

• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset = expensive

• SPARQL viable via the web– 400GB of UniProt data can stay at UniProt– Your NGS data can stay in your data centre

• Easiest data compression is avoiding a 100 copies ;)

biohackathon2013: tripling bioinformatics productivity

Education

csv sparql

related data

code sparql

sparql engine

gb of uniprot data

relation object thing

data structuring knowledge

data weaknesses time