fostering serendipity through big linked data
DESCRIPTION
Semantic Web Challenge - Big Data track winner at ISWC2013TRANSCRIPT
![Page 1: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/1.jpg)
Fostering Serendipity through Big Linked Data
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille
Ngonga Ngomo
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
![Page 2: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/2.jpg)
Agenda
• Motivation• Datasets• Architecture• Evaluation• Requirements• Demo• Conclusion and Future Work
![Page 3: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/3.jpg)
Motivation
Fostering Serendipity through Big Data Triplification, Continuous Integration,
and Visualization
![Page 4: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/4.jpg)
Triplification: Linked TCGA• TCGA is publicly accessible atlas of cancer
related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– 12.7 TB
• Only 46% of the total expected data with new data being submitted every day
• Goal is to enable cancer researchers to make and validate important discoveries
• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
![Page 5: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/5.jpg)
Triplification:PubMed• Collection of publications from the bio-
medical domain• Large amount of metadata (MESH Terms)• 23+ million publications• 10,000 new publications/month
![Page 6: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/6.jpg)
Big Data Continuous Integration
TopFed
Parser
Federator Optimizer
Integrator
Results
ResultsSPARQL Query
Sub-queryPubMed
Entrez UtilitiesRDFizer
Auto Loader
TCGA Data Portal
SPARQL endpoint
RDF
SPARQL endpoint
RDF
SPARQL endpoint
RDF
Index
![Page 7: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/7.jpg)
b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}
F = {Expression-Exon}M = {beta_value, position}
(CNV, SNP, E-Gene, miRNA, E-Protein, Clinical)
Exon-Expression
Methylation
D = {seg_mean, rpmmm, scaled_est, p_exp_val}
C-2 = {{p {∈ E ∪ A ∪ G} ∨ {p = rdf:type o ∧ ∈ F}} ∧ {{S-Join(p, E ∪ F) P-Join(∨ p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}
C-3 = {{p {∈ M ∪ A} {p = rdf:type o ∨ ∧ ∈ B}} ∧ {{S-Join(p, M ∪ B) P-Join(∨ p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}
C-1 = {{p {∈ D ∪ A ∪ G} {p = rdf:type o ∨ ∧ ∈ C}} ∧ {{S-Join(p, D ∪ C) P-Join(p, ∨ D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}
C-1 Category ∨Colour = blue
IF tumour lookup is successful forward to corresponding leafElse broadcast to every one
For each query triple t(s, p, o) T ∈
A = {chromosome, result, bcr_patient_barcode} G = {start, stop}
B = {DNA-Methylation}
E = {RPKM}
Tumours
SPARQL endpoints
C-2 Category ∨Colour = pink
C-3 Category ∨Colour = green
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
Highly Scalable
![Page 8: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/8.jpg)
Evaluation:Number of Sub-Query Submission
• TopFed number of sub-queries submission is 1/3 to FedX• Number of ASK requests
– FedX 480– TopFed 10
1 2 3 4 5 6 7 8 9 10 Avg0
10
20
30
40
50
60
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission
![Page 9: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/9.jpg)
Evaluation: Query Runtime
1 2 3 4 5 6 7 8 9 10 Average10
100
1000
10000
100000FedX TopFed
Que
ry E
xecu
tion
Tim
e (m
sec)
in
log
scal
e
• TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times
smaller than that of FedX
![Page 10: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/10.jpg)
Big Data Track Requirements• Data Volume
– 7.36 billion triples from Linked TCGA – 23 million publications from PubMed
• Data Variety– The Linked TCGA data was extracted from raw text files of different
structures– Processed the metadata associated with PubMed publications and
transform them into RDF– Unstructured data (publication abstracts) is processed to extract mentions
of gene names and cancers
• Data Velocity– TCGA data doubles /2 months– PubMed publications 10k/month
![Page 11: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/11.jpg)
Big Data Visualization
![Page 12: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/12.jpg)
Tumor-wise Visualization
![Page 13: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/13.jpg)
PubMed Paper-wise Visualization
![Page 14: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/14.jpg)
Genome-wise Patients Results Visualization
![Page 15: Fostering Serendipity through Big Linked Data](https://reader034.vdocument.in/reader034/viewer/2022051514/54c290ad4a7959832a8b4603/html5/thumbnails/15.jpg)
Everything is Public• Demo: http://srvgal78.deri.ie/tcga-pubmed/• TopFed: https://code.google.com/p/topfed/• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ• Utilities: http://goo.gl/kNrFdI• Linked TCGA : http://tcga.deri.ie/
[email protected] AKSW, University of Leipzig, Germany