gdg meets u event - big data & wikidata - no lies codelab

Post on 27-Jan-2015

113 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

RDF triples informations on Wikipedia. Making SPARQL queries on DBpedia endpoint

TRANSCRIPT

BigData & Wikidata - no liesSPARQL queries on DBPedia

Camelia Boban

BigData & Wikidata - no lies

Resources for the codelab:

Eclipse Luna for J2EE developers - https://www.eclipse.org/downloads/index-developer.php

Java SE 1.8 - http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html

Apache Tomcat 8.0.5 - http://tomcat.apache.org/download-80.cgi

Axis2 1.6.2 - http://axis.apache.org/axis2/java/core/download.cgi

Apache Jena 2.11.1 - http://jena.apache.org/download/

Dbpedia Sparql endpoint: - dbpedia.org/sparql

BigData & Wikidata - no lies

JAR needed:

httpclient-4.2.3.jar httpcore-4.2.2.jar Jena-arq-2.11.1.jar

Jena-core-2.11.1.jar Jena-iri-1.0.1.jar jena-sdb-1.4.1.jar jena-tdb-1.0.1.jar

slf4j-api-1.6.4.jar slf4j-log4j12-1.6.4.jar

xercesImpl-2.11.0.jar xml-apis-1.4.01.jar

Attention!!

NO jcl-over-slf4j-1.6.4.jar (slf4j-log4j12-1.6.4 conflict, “Can’t override final class exception”)

NO httpcore-4.0.jar (made by Axis, httpcore-4.2.2.jar conflict, don’t let create the WS)

BigData & Wikidata - no lies

The Semantic WebThe Semantic Web is a project that intends to add computer-processable

meaning (semantics) to the Word Wide Web.

SPARQL

A a protocol and a query language SQL-like for querying RDF graphs via pattern matching

VIRTUOSO

Both back-end database engine and the HTTP/SPARQL server.

BigData & Wikidata - no lies

BigData & Wikidata - no lies

DBpedia.org

Is the Semantic Web mirror of Wikipedia.

RDFIs a data model of graphs on subject, predicate, object triples.

APACHE JENA

A free and open source Java framework for building Semantic Web and Linked Data applications.

ARQ - A SPARQL Processor for Jena for querying Remote SPARQL Services

BigData & Wikidata - no lies

BigData & Wikidata - no lies

DBpedia.org extracts from Wikipedia editions in 119 languages, convert

it into RDF and make this information available on the Web:

★ 24.9 million things (16.8 million from the English Dbpedia);

★ labels and abstracts for 12.6 million unique things;

★ 24.6 million links to images and 27.6 million links to external

web pages;

★ 45.0 million external links into other RDF datasets, 67.0 million

links to Wikipedia categories, and 41.2 million YAGO categories.

BigData & Wikidata - no lies

The dataset consists of 2.46 billion RDF triples (470 million were

extracted from the English edition of Wikipedia), 1.98 billion from other

language editions, and 45 million are links to external datasets.

DBpedia uses the Resource Description Framework (RDF) as a

flexible data model for representing extracted information and for

publishing it on the Web. We use the SPARQL query language to query

this data.

BigData & Wikidata - no lies

BigData & Wikidata - no lies

What is a Triple?

A Triple is the minimal amount of information expressable in Semantic Web. It is composed of 3 elements:

1. A subject which is a URI (e.g., a "web address") that represents

something.

2. A predicate which is another URI that represents a certain

property of the subject.

3. An object which can be a URI or a literal (a string) that is related

to the subject through the predicate.

BigData & Wikidata - no lies

John has the email address

john@email.com

(subject) (predicate) (object)

Subjects, predicates, and objects are represented with URIs, which can

be abbreviated as prefixed names.

Objects can also be literals: strings, integers, booleans, etc.

BigData & Wikidata - no lies

Why SPARQL?

SPARQL is a quey language of the Semantic Web that lets us:

1. Extract values from structured and semi-strutured data

2. Explore data by querying unknown relatioships

3. Perform complex join query of various dataset in a unique query

4. Trasform data from a vocabulary in another

BigData & Wikidata - no lies

Structure of a SPARQL query:

● Prefix declarations, for abbreviating URIs ( PREFIX dbpowl: <http://dbpedia.org/ontology/Mountain> = dbpowl:Mountain)

● Dataset definition, stating what RDF graph(s) are being queried (DBPedia, Darwin Core Terms, Yago, FOAF - Friend of a Friend)

● A result clause, identifying what information to return from the query The query pattern, specifying what to query for in the underlying dataset (Select)

● Query modifiers, slicing, ordering, and otherwise rearranging query results - ORDER BY, GROUP BY

BigData & Wikidata - no lies

BigData & Wikidata - no lies

##EXAMPLE - Give me all cities & towns in Abruzzo with more than 50,000 inhabitants

PREFIX dbpclass: <http://dbpedia.org/class/yago/> PREFIX dbpprop: <http://dbpedia.org/property/> SELECT ?resource ?value WHERE { ?resource a dbpclass:CitiesAndTownsInAbruzzo . ?resource dbpprop:populationTotal ?value . FILTER ( ?value > 50000 ) } ORDER BY ?resource ?value

BigData & Wikidata - no lies

BigData & Wikidata - no lies

Some PREFIX:

PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX dc: <http://purl.org/dc/elements/1.1/>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dcterms: <http://purl.org/dc/terms/>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX txn: <http://lod.taxonconcept.org/ontology/txn.owl#>

BigData & Wikidata - no lies

DBPEDIA

----------------------------------------------------------------------------------

PREFIX dbp: <http://dbpedia.org/>

PREFIX dbpowl: <http://dbpedia.org/ontology/>

PREFIX dbpres: <http://dbpedia.org/resource/>

PREFIX dbpprop: <http://dbpedia.org/property/>

PREFIX dbpclass: <http://dbpedia.org/class/yago/>

BigData & Wikidata - no lies

Wikipedia articles consist mostly of free text, but also contain different

types of structured information: infobox templates,

categorisation information, images, geo-coordinates, and links to

external Web pages. DBpedia transforms into RDF triples data that are

entered in Wikipedia. So creating a page in Wikipedia creates RDF in

DBpedia.

BigData & Wikidata - no lies

BigData & Wikidata - no lies

Example:

https://en.wikipedia.org/wiki/Pulp_Fiction describes the movie. DBpedia

creates a URI: http://dbpedia.org/resource/wikipedia_page_name (where

wikipedia_page_name is the name of the regular Wikipedia html page)

= http://dbpedia.org/page/Pulp_Fiction. Underscore characters replace

spaces.

DBpedia can be queried via a Web interface at ttp://dbpedia.org/sparql .

The interface uses the Virtuoso SPARQL Query Editor to query the

DBpedia endpoint.

BigData & Wikidata - no lies

Public SPARQL Endpoint - use OpenLink Virtuoso

Wikipedia page: http://en.wikipedia.org/wiki/Pulp_Fiction

DBPedia resource: http://dbpedia.org/page/Pulp_Fiction

InfoBox: dbpedia-owl:abstract; dbpedia-owl:starring; dbpedia-owl:

budget; dbpprop:country; dbpprop:caption ecc.

For instance, the figure below shows the source code and the

visualisation of an infobox template containing structured information

about Pulp Fiction.

BigData & Wikidata - no lies

Big&Wikidata - no lies

Big&Wikidata - no liesPREFIX prop: <http://dbpedia.org/property/>

PREFIX res:<http://dbpedia.org/resource/>

PREFIX owl:<http://dbpedia.org/ontology/>

SELECT DISTINCT ?name ?abstract ?caption ?image ?budget ?director ?cast ?country ?

category

WHERE {

res:Pulp_Fiction prop:name ?name ;

owl:abstract ?abstract ;

prop:caption ?caption;

owl:thumbnail ?image;

owl:budget ?budget ;

owl:director ?director ;

owl:starring ?cast ;

prop:country ?country ;

dcterms:subject ?category .

FILTER langMatches( lang(?abstract), 'en').

}

Big&Wikidata - no lies

...

Linked Data is a method of publishing RDF data on the Web and of interlinking data between different data sources.

Query builder:

➢ http://dbpedia.org/snorql/

➢ http://querybuilder.dbpedia.org/

➢ http://dbpedia.org/isparql/

➢ http://dbpedia.org/fct/

➢ http://it.dbpedia.org/sparql

Prefix variables start with "?"

BigData & Wikidata - no lies

The current RDF vocabularies are available at the following locations:

➔ W3: http://www.w3.org/TR/vcard-rdf/ vCard Ontology - for

describing People and Organizations

http://www.w3.org/2003/01/geo/ Geo Ontology - for spatially-

located things

http://www.w3.org/2004/02/geo/ SKOS Simple Knowledge

Organization System

BigData & Wikidata - no lies

➔ GEO NAMES: http://www.geonames.org/ geospatial semantic information (postal code)

➔ DUBLIN CORE: http://www.dublincore.org/ defines general metadata attributes used in a particular application

➔ FOAF: http://www.foaf-project.org/ Friend of a Friend, vocabulary for describing people

➔ UNIPROT: http://www.uniprot.org/core/, http://beta.sparql.uniprot.org/uniprot for science articles

BigData & Wikidata - no lies

➔ MUSIC ONTOLOGY: http://musicontology.com/, provides terms for describing artists, albums and tracks.

➔ REVIEW VOCABULARY: http://purl.org/stuff/rev , vocabulary for representing reviews.

➔ CREATIVE COMMONS (CC): http://creativecommons.org/ns , vocabulary for describing license terms.

➔ OPEN UNIVERSITY: http://data.open.ac.uk/

BigData & Wikidata - no lies

➔ Semantically-Interlinked Online Communities (SIOC): www.sioc-project.org/, vocabulary for representing online communities

➔ Description of a Project (DOAP): http://usefulinc.com/doap/, vocabulary for describing projects

➔ Simple Knowledge Organization System (SKOS): http://www.w3.org/2004/02/skos/, vocabulary for representing taxonomies and loosely structured knowledge

BigData & Wikidata - no lies

BigData & Wikidata - no lies

SPARQL queries have two parts (FROM is not indispensable):

1. The query (WHERE) part, which produces a list of variable bindings

(although some variables may be unbound).

2. The part which puts together the results. SELECT, ASK, CONSTRUCT,

or DESCRIBE.

Other keywords:

UNION, OPTIONAL (optional display if data exists), FILTER

(conditions), ORDER BY, GROUP BY

BigData & Wikidata - no lies

SELECT - is effectively what the query returns (a ResultSet)

ASK - just looks to see if there are any results

COSTRUCT - uses a template to make RDF from the results. For each

result row it binds the variables and adds the statements to the result

model. If a template triple contains an unbound variable it is skipped.

Return a new RDF-Graph

DESCRIBE - unusual, since it takes each result node, finds triples

associated with it, and adds them to a result model. Return a new RDF-

Graph

BigData & Wikidata - no lies

What linked data il good for? Don’t search a single thing, but explore a whole set of related things together!

1) Revolutionize Wikipedia Search

2) Include DBpedia data in our own web page

3) Mobile and Geographic Applications

4) Document Classification, Annotation and Social Bookmarking

5) Multi-Domain Ontology

6) Nucleus for the Web of Data

7) Support Wikipedia Authors with Editing Suggestions

BigData & Wikidata - no lies

BigData & Wikidata - no lies

MOBILE

QRpedia.org - MIT Licence

BigData & Wikidata - no lies

WIKIPEDIA DUMPS

● Arabic Wikipedia dumps: http://dumps.wikimedia.org/arwiki/

● Dutch Wikipedia dumps: http://dumps.wikimedia.org/nlwiki/

● English Wikipedia dumps: http://dumps.wikimedia.org/enwiki/

● French Wikipedia dumps: http://dumps.wikimedia.org/frwiki/

● German Wikipedia dumps: http://dumps.wikimedia.org/dewiki/

● Italian Wikipedia dumps: http://dumps.wikimedia.org/itwiki/

● Persian Wikipedia dumps: http://dumps.wikimedia.org/fawiki/

● Polish Wikipedia dumps: http://dumps.wikimedia.org/plwiki/

BigData & Wikidata - no lies

WIKIPEDIA DUMPS

● Portuguese Wikipedia dumps: http://dumps.wikimedia.org/ptwiki/

● Russian Wikipedia dumps: http://dumps.wikimedia.org/ruwiki/

● Serbian Wikipedia dumps: http://dumps.wikimedia.org/srwiki/

● Spanish Wikipedia dumps: http://dumps.wikimedia.org/eswiki/

● Swedish Wikipedia dumps: http://dumps.wikimedia.org/svwiki/

● Ukrainian Wikipedia dumps: http://dumps.wikimedia.org/ukwiki/

● Vietnamese Wikipedia dumps: http://dumps.wikimedia.org/viwiki/

BigData & Wikidata - no lies

LINK

Codelab’s project code: http://github.com/GDG-L-Ab/SparqlOpendataWS

http://dbpedia.org/sparql & http://it.dbpedia.org/sparql

http://wiki.dbpedia.org/Datasets

http://en.wikipedia.org/ & http://it.wikipedia.org/

http://dbpedia.org/snorql, http://data.semanticweb.org/snorql/ SPARQL Explorer

http://downloads.dbpedia.org/3.9/ & http://wiki.dbpedia.org/Downloads39

BigData & Wikidata - no lies

Projects that use linked data:

JAVA: Open Learn Linked data: free access to Open University course

materials

PHP: Semantic MediaWiki -Ll

lets you store and query data within the wiki's pages.

PEARL: WikSAR

PYTHON: Braindump - semantic search in Wikipedia

RUBY: SemperWiki

Lets you store and query data within the wiki's page

BigData & Wikidata - no lies

BigData & Wikidata - no liesTHANK YOU! :-)

I AM

CAMELIA BOBAN

G+ : https://plus.google.com/u/0/+cameliaboban

Twitter : http://twitter.com/GDGRomaLAb

LinkedIn: it.linkedin.com/pub/camelia-boban/22/191/313/Blog: http://blog.aissatechnologies.com/Skype: camelia.bobancamelia.boban@gmail.com

top related