scaling out federated queries for life sciences data in production
TRANSCRIPT
![Page 1: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/1.jpg)
SCALING OUT FEDERATED QUERIES
FOR LIFE SCIENCES DATA IN PRODUCTION
Dieter De Witte, Laurens De Vocht, et al.
• IMEC– IDLAB – GHENT UNIVERSITY
• ONTOFORCE
![Page 2: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/2.jpg)
Catch 22!?
A. No Semantic Web Applications
because no Semantic Data
B. No Semantic Data
because no applications
![Page 3: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/3.jpg)
A. The LOD Cloud for Life Sciences...
Ontoforce’s
DISQOVER
covers
> 110
Life Sciences Datasets
![Page 4: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/4.jpg)
B. DISQOVER is an Exploratory Semantic Search UI (faceted browsing)
To Click = To SPARQL
![Page 5: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/5.jpg)
The missing link in our catch 22? “How to run federated queries?”
Direct ETL
![Page 6: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/6.jpg)
• Cloud Instances
• PAGO amis:
Scientific Benchmark = Reproducible Benchmark
Benchmark Client
• 1 single-threaded warm-up run (all 1,223 queries)
• 1 multi-threaded (8) run
• (8 x randomized order)
Database Node(s)
![Page 7: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/7.jpg)
How to evaluate an RDF Database solution? Performance (
Data store,
Dataset,
Configuration,
Number of nodes,
Hardware (RAM)
)
![Page 8: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/8.jpg)
Performance (
NoSQL Triple stores,
Watdiv 10M, 100M, 1000M,
Standard Configs,
Single Node,
32 GB RAM
)
SIGMOD 2016: Single Node SOTA on artificial data
![Page 9: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/9.jpg)
More data, more problems
Timeout
![Page 10: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/10.jpg)
Query performance: Virtuoso Leads, Blazegraph follows
Timeout
![Page 11: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/11.jpg)
SWAT4LS 2016: Multi-node SOTA on real data
Performance (
Scale out systems,
DISQOVER data,
Optimized Configs,
Multi-Node, Compression
64 GB RAM
)
![Page 12: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/12.jpg)
How to deal with Big Linked Data?
1. Vertical Scaling: bigger box
2. Compression: smaller content
3. Horizontal Scaling: more boxes, 1 location
4. Federation: more boxes, more locations
V1, Bla1 (single node Virtuoso, Blazegraph)
V1_32 (32GB Virtuoso)
Fu1 (Fuseki + HDT)
V3 (Virtuoso cluster 3 nodes)
Fl3 (FluidOps, aka FedX)
![Page 13: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/13.jpg)
DISQOVER dataset ...
and queries
![Page 14: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/14.jpg)
Count, Union, Sort, Aggregations
![Page 15: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/15.jpg)
Example Query 1: Nesting, FILTERs, unbound triples
![Page 16: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/16.jpg)
Example Query 4: Aggregations, Optionals
![Page 17: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/17.jpg)
Initial performance results were counter-intuitive... and incorrect!!!
Worse hardware, better performance?
![Page 18: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/18.jpg)
Only Virtuoso-backed systems survive multi-threaded benchmark
marks last successful query (no timeout)
1 x 1,223 queries 8 x 1,223 queries
![Page 19: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/19.jpg)
No errors but incorrect #results!!!
![Page 20: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/20.jpg)
FILTERs, UNIONs are challenging but ORDER + GROUP + OPTIONAL dominate
COUNT DISTINCT
600 – 1,223 BGPs
![Page 21: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/21.jpg)
Conclusions & Future Work
• Additional diagnostics for RDF solutions!
• Extend benchmarking software with query correctness assessment!
• Multi-node RDF solutions???
• Towards Full paper:
– NoSQL for Ontoforce Data
– Scale out approaches for Watdiv + test LDF
– Release reusable end-to-end benchmark software:
• Setup AND Postprocessing
![Page 22: Scaling out federated queries for Life Sciences Data In Production](https://reader031.vdocument.in/reader031/viewer/2022022414/5871bd9f1a28ab55058b5f4d/html5/thumbnails/22.jpg)
Thanks for your attention!!
SCALING OUT FEDERATED QUERIES
FOR LIFE SCIENCES DATA IN PRODUCTION
Dieter De Witte, Laurens De Vocht, et al.
contact: [email protected]
slideshare:
• IMEC– IDLAB – GHENT UNIVERSITY
• ONTOFORCE