triplestore and sparql

TRIPLESTORE AND SPARQL

Lino Valdivia Jr04.06.2013

OUTLINE

The Semantic Web

RDF

SPARQL

Triplestores

Apache Jena

DBPedia

Conclusions

Demo1: Apache Jena API

Demo2: DBPedia

THE SEMANTIC WEB

Most of the data in the web consists of unstructured or semi-structured data HTML documents Multimedia: images, video streams, audio files Meant to read and processed by humans

What if we can structure and add metadata to this “Web of Documents”, and make them understandable by machines? Metadata → meaning, or semantics Machines can perform new tasks that used to require human

intervention

This is the motivation behind the Semantic Web! The term “Semantic Web” was initially coined by Tim Berners-Lee: “a

web of data that can be processed directly and indirectly by machines.”

THE SEMANTIC WEB

“The Semantic Web is a web of data…[it] provides a common framework that allows data to be shared and reused across

application, enterprise, and community boundaries.”

[w3.org]

For the Semantic Web to happen, we would need

1. A way to structure and link data in a standardized way

2. A way to describe the relationships of these data in a common way

3. A way to query that linked data

4. A way to infer something from that linked data (by applying a set of rules)

but we will only focus on #1 and #3

RDF: A WAY TO STRUCTURE AND LINK DATA RDF = Resource Description Framework, a standard way for applications to represent information that can then be shared and processed

A resource can be anything that is identifiable: a user, a coffee cup, a picture of your cat, a bank statement

RDF provides a way to model data by breaking it down into three components: The subjectThe objectThe predicate (aka the property).

RDF AS A GRAPH

Consider the following statement: Jordi lives in Barcelona Subject: Jordi Object: Barcelona Predicate: lives-in (or, to be more precise, address-city)

RDFs are typically represented as a labeled directed graph:

The arrow points from the subject to the object

Jordi Barcelona

address-city

RDFS AND URIS

Resources must be identifiable, and RDF uses Uniform Resource Identifier (URI) references.

E.g. Jordi = http://example.org/Jordi

URIs <> URLs!!!

RDF graphs are typically shown with the URIs for the subject, object, and predicate:

The RDF graph can also be rewritten in text as:<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .

As you may have guessed, RDF is more machine-friendly than human-friendly!

http://...Jordi http://.../Barcelona

http://.../address-city

RDF: RESOURCES AND LITERALS The object of a triple in RDF can either be a resource (identified by URIs) or a literal (values such as strings and numbers):

We can represent the RDF graph above as text as:<http://example.org/Jordi> <http://example.org/address-city> <http://example.org/Barcelona> .<http://example.org/Jordi> <http://example.org/firstname> “Jordi” .<http://example.org/Jordi> <http://example.org/age> “37” .

This textual representation is also known as Terse RDF Triple Language, or Turtle for short.

http://...Jordi http://.../Barcelona

http://...address-city

“Jordi” 37

http://...agehttp://...firstname

RDF: PREFIXES

Prefixes can be used to simplify representations, either in graphs: prefix ex: http://example.org

or in Turtle:@prefix ex:<http://example.org/> .ex:Jordi ex:address-city ex:Barcelona .ex:Jordi ex:firstname “Jordi” .ex:Jordi ex:age “37” .

Now that we have a way to structure and link our data, we want to be able to query it for information.

ex:Jordi ex:Barcelonaex:address-city

“Jordi” 37

ex:ageex:firstname

SPARQL: A WAY TO QUERY LINKED DATA SPARQL = SPARQL Protocol and RDF Query Language

SPARQL 1.1 became a W3C Recommendation on March 2013!

Example: given our RDF graph, show all users who live in Barcelona:

PREFIX ex: <http://example.com/>SELECT ?fnameFROM <users.rdf>WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname .}

SPARQL AND GRAPH PATTERNS The statements in the WHERE clause form a graph pattern, which is matched against subgraphs in the RDF graph to form the solution.

PREFIX ex: <http://example.com/>SELECT ?fnameFROM <users.rdf>WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname .}

ex:Jordi ex:Barcelona

ex:address-city

“Jordi”

37

ex:ageex:firstname

ex:Badalona ex:Josep

ex:address-city

SPARQL: THE SELECT OPERATIONSPARQL SELECT operations also support:

FILTERs, ORDER BYs, LIMITs, and OFFSETs:

Show the names of users who live in Barcelona and are less than 40 years old, starting from the 11th to 40th user:

PREFIX ex: <http://example.com/>SELECT ?lname ?fnameFROM <users.rdf>WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . ?user ex:lastname ?lname . ?user ex:age ?age FILTER (?age < 40)}ORDER BY ?lnameLIMIT 30OFFSET 10


Alternative matches using UNION, for those cases where resources in the expected result set may match multiple patterns:

Show the first names of users who live in Barcelona or in Badalona:

PREFIX ex: <http://example.com/>SELECT ?fnameFROM <users.rdf>WHERE { ?user ex:firstname ?fname . { { ?user ex:address-city ex:Barcelona . } UNION { ?user ex:address-city ex:Badalona . } } }


OPTIONAL matches, for those cases where not all resources in the expected result set do not have to match a pattern:

Show the first names of users who live in Barcelona and their profile pic image, if they have one:

PREFIX ex: <http://example.com/>SELECT ?fname ?ppicFROM <users.rdf>WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname . OPTIONAL { ?user ex:ppic ?ppic . } }


Set inclusion (IN/NOT IN)

GROUP BY, HAVING, and aggregate functions such as COUNT and AVG (new in SPARQL 1.1)

Subqueries (new in SPARQL 1.1)

SPARQL: OTHER OPERATIONSAside from SELECTs for querying, SPARQL also has

CONSTRUCT – creates a single RDF graph from the result of a query by combining (i.e. applying set union on) the resulting triples

ASK – returns a Boolean that indicates whether the query is resolvable or not

DESCRIBE – returns an RDF graph that describes the result (as determined by the query service)

INSERT/DELETE – adds or removes triples from the graph (new in SPARQL 1.1)

Graph management operations (CREATE, DROP, COPY, MOVE, ADD) (new in SPARQL 1.1)

TRIPLESTORES The statements in an RDF graph (subject-predicate-object) are also known as triples, and the specialized database used for storing them are called triplestores.

Triplestores vs Graph Databases – What’s the diff?

Triplestores are especially designed to store RDF graphs, which are labeled directed graphs

On the other hand, graph databases can store any kind of graph (unlabeled, undirected, weighted, etc.)

Graph databases don’t have a standard query language (Cypher?)

Triplestores must support SPARQL

Triplestores are optimized for graph pattern matching, and may lack the full capabilities of graph DBs

But graph databases can be used to implement a triplestore (see Sequeda, J. (2013, January 31) Introduction to Triplestores)

SPARQL AND CYPHER

SPARQL:PREFIX ex: <http://example.com/>SELECT ?fnameFROM <users.rdf>WHERE { ?user ex:address-city ex:Barcelona . ?user ex:firstname ?fname .}

Cypher:MATCH user–[:ex_firstname]->fname, user-[:ex_address-city]->cityWHERE city.uri = “ex:Barcelona”RETURN fname

ex:Jordi ex:Barcelona

ex:address-city

“Jordi”

37

ex:ageex:firstname

TRIPLESTORE IMPLEMENTATIONSNative TriplestoresSesameBigDataMeronymyApache Jena TDB

Graph DB-based AllegroGraphOracle Spatial and Graph (formerly Oracle Semantic Technologies)

Relational DB-basedApache Jena SDBIBM DB2

APACHE JENA

Born in HP Labs in 2000, became a top-level Apache project in April 2012

The Jena Framework includes

A Java API for working with RDF models

A SPARQL query processor

An efficient disk-based native triplestore

A rule-based inference engine that can be used with RDF-based ontologies

A server for accepting SPARQL queries over HTTP (a SPARQL endpoint)

APACHE JENA: RDF API

The Statement interface represents triples, while the Model interface represents the whole RDF graph

Given a Statement, one could invoke getSubject(), which would return a Resource getPredicate(), which would return a Property getObject(), which would return an RDFNode (which can be a Resource or a Literal)

To create our example basic RDF graph:

Model model = ModelFactory.createDefaultModel();Resource j = model.createResource(“http://example.org/Jordi”);Resource bcn = model.createResource(“http://example.org/Barcelona”);Property addrCity = model.createProperty(“ex”, “address-city”);

// This automatically creates a Statement in the associated model.j.addProperty(addrCity, bcn);

APACHE JENA: ARQ API

Jena also provides an API called ARQ for programmatically executing SPARQL queries.

To execute a given query on our example graph:

String queryString = “...”;Query query = QueryFactory.create(queryString);// Associate a query execution context against our model.QueryExecution qe = QueryExecutionFactory.create(query, model);

ResultSet rs = qe.execSelect();

// ResultSet acts like an Iterator.for (; rs.hasNext();){ QuerySolution qs = rs.nextSolution(); RDFNode r = qs.get(“fname”); // You can get a variable by name. // Do what you want with it.}

// Always good to close resources when done.qe.close();

APACHE JENA: TDB

Jena’s native triplestore implementation is called TDB and consists of

The node tablestores resources, predicates (relationships), and literalsmaps nodes to internal node ids, and vice versa node ids are 8 bytes (64 bits) long

The triple indexesstores 3 indexes into the node table

The prefixes tablemaps prefixes to URIs

TDB also supports ACID transactions using write-ahead logging.

But no transaction is needed if there is only one single writer (even with multiple concurrent readers)

RDF/SPARQL IN ACTION: DBPEDIA.ORG DBPedia describes itself as a “crowdsourced community effort to extract structured information from Wikipedia” 1.89 billion triples localized in 111 languages English dataset contains 3.77 million topics

Imagine if you can ask Wikipedia… Which towns in Cataluña have a population between 10,000 and 50,000 people? What are the birthdays of all blues guitarists who were born in Chicago? (sample query from DBPedia.org wiki) Show me all soccer players who played as

goalkeeper for a club that has a stadium with more than 40,000 seats and who are born in a country with more than 10 million inhabitants

DBPedia also provides a SPARQL endpoint, so other websites can query its data and get results that are continuously updated

DBPedia also contains geo-coordinates obtained from other sources (e.g. Geonames, Eurostat, CIA World Fact Book) – this opens the possibility for location-based applications from mobile devices

CONCLUSIONS

The Semantic Web – Web 3.0?

RDF and SPARQL are key technologies in the W3C’s vision of the web of tomorrow

Companies like Google, Tesco, and Best Buy already produce RDF content!

Add some SPARQL to your projects!

Source: w3.org

BIBLIOGRAPHY

Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. http://www.scientificamerican.com/article.cfm?id=the-semantic-web

W3 Consortium. (2004, February 10). RDF Primer. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/

W3 Consortium. (2013, March 21). SPARQL 1.1 Query Language http://www.w3.org/TR/sparql11-query/

Sequeda, J. (2013, January 31) Introduction to Triplestores http://semanticweb.com/introduction-to-triplestores_b34996

Apache Jena http://jena.apache.org/

DBPedia http://dbpedia.org/

http://www.scientificamerican.com/article.cfm?id=the-semantic-web



http://www.w3.org/TR/2004/REC-rdf-primer-20040210/



http://www.w3.org/TR/sparql11-query/



http://semanticweb.com/introduction-to-triplestores_b34996

http://semanticweb.com/introduction-to-triplestores_b34996

http://jena.apache.org/

http://jena.apache.org/

triplestore and sparql

Documents

jordi ex

link data rdf

jordi object

barcelona ex

addresscity ex

rdf graphs

address city jordi

prefix ex