sparql,uniprot.org in production

17
The UniProt SPARQL endpoint: in production

Upload: jerven-bolleman

Post on 13-Apr-2017

807 views

Category:

Science


3 download

TRANSCRIPT

Page 1: sparql,uniprot.org in production

The UniProt SPARQL endpoint: in production

Page 2: sparql,uniprot.org in production

© 2015 SIB

Why provide a public SPARQL endpoint

• A 10 man wet laboratory can not afford: – to host their own database houses holding all or even a bit of all

life science data. – not to have access, and use, existing life science information.

• Classical SQL can be provided on the web – Is not practical – No federation – No standards adherence

• Document centric REST is not enough – Swiss-Prot available as REST (over e-mail !!) since

1986 – www.uniprot.org since 2002

2

Page 3: sparql,uniprot.org in production

© 2015 SIB3

Page 4: sparql,uniprot.org in production

© 2015 SIB4

Page 6: sparql,uniprot.org in production

© 2015 SIB

14,821,380,921

6

Node 1

64 cpu cores 256 GB ram 2.5 TB consumer SSD

Load Balancer = Apache mod_balancer

Node 2

64 cpu cores 256 GB ram 2.5 TB consumer SSD

Page 7: sparql,uniprot.org in production

© 2015 SIB7

Node 1 Node 2

Tomcat + Sesame + UI

Virtuoso 7.2 (+)

Tomcat + Sesame + UI

Virtuoso 7.2 (+)

Load Balancer = Apache mod_balancer

14,821,380,921

Page 8: sparql,uniprot.org in production

15,288,484,658

© 2015 SIB8

Node 1 Node 2

Tomcat + Sesame + UI

Virtuoso 7.2 (+)

Tomcat + Sesame + UI

Virtuoso 7.2 (+)

Load Balancer = Apache mod_balancer

Load Balancer = Apache mod_balancer

2 independant datacentres

Page 9: sparql,uniprot.org in production

© 2015 SIB

Dedicated machine for loading and testing

• Loading RDF data “solved” problem – 500,000 triples per second easy

• that’s what our machine plus virtuoso 7.2 – and some tricks does

– 1,000,000 possible (xz unzip limit on our machine) • nquads or rdf/xml

– higher values needs parallel readers • or even lighter weight parsers

– highest observed rate • 2.5 million per second on 1/4 exadata

– could be pushed higher

9

Page 10: sparql,uniprot.org in production

© 2015 SIB

Openlink SW: Virtuoso 7.2

• very responsive to issues • performance is good and getting better

• not quite fully SPARQL 1.1 yet (corner cases) • anytime query

– do not accept the default settings – spend some time securing your setup

– (jails,cgroups, etc…)

10

Page 11: sparql,uniprot.org in production

© 2015 SIB

Challenges as a public endpoint provider

• Query load unpredictable

• Simple data discovery queries are hard – 1 TB+ of DB files – e.g. from monitoring services

• Query timeouts not sufficient – aim for 100% utilisation – what can http reasonably support – we want to be able to answer hard questions

11

Page 12: sparql,uniprot.org in production

© 2015 SIB12

Entries in UniProtKB over time

Growing by 400 million triples a month

Page 13: sparql,uniprot.org in production

100

10'000

1'000'000

2015-01

2015-02

2015-03

2015-04

2015-05

2015-06

2015-07

2015-08

queriesaskselectconstructdescribe

© 2015 SIB

Queries per month in 2015 peak: 4 million per month

13

Page 14: sparql,uniprot.org in production

© 2015 SIB

Real users

Mix between hard analytics and super specific Estimate somewhere between: 300 - 1000 real humans per month

We know they are real because they take holidays ;)

14

Page 15: sparql,uniprot.org in production

© 2015 SIB15

Page 16: sparql,uniprot.org in production

• Public monitoring also hard – often lower uptime than what is being monitored – robots.txt – not enough community support – service description

• not being parsed – HEAD last modified?

© 2015 SIB

Public monitoring key aid in quality assurance

16

✔sparqles

Page 17: sparql,uniprot.org in production

© 2015 SIB

Key-Value orientated SPARQL endpoint anyone?• assume 400 million

named graphs – average 50 triples

• max 5000 triples – get the whole

named graph • single IO

operation

17

CONSTRUCT {} FROM uniprot:P05067 WHERE {}