solutions linux 2011 merged

Upload: stefane-fermigier

Post on 07-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Solutions Linux 2011 Merged

    1/68

  • 8/6/2019 Solutions Linux 2011 Merged

    2/68

    Agenda

    A pragmatic introduction to the SemanticWeb

    Experience report and demos from Nuxeo

    Apache tools for Big Linked Data

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    3/68

    1. Introduction to theSemantic Web

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    4/68

    Prelude

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    5/68

    Source: Mills Davis, Semantic Social Computing, sept. 2007Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    6/68

    History

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    7/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    8/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    9/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    10/68

    Historical perspective

    From web 1.0: web of sites and pages,aka the World Wide Web

    To web 2.0: web of people and ofparticipation, aka the Social Web (Blogs,RSS, tags, Facebook, Wikipedia, etc.)

    To web 3.0: web of data, of meaning andconnected knowledge, aka the SemanticWeb

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    11/68

    Semantics & Ontologies

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    12/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    13/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    14/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    15/68Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    16/68

    Some examples

    FOAF: relationships between people (socialnetwork)

    SIOC: relationships between websites,articles, blogs, comments

    Rich Snippets: syndicate RDFa content forSEO by Google, Yahoo

    good-relations: e-commerce (Ebay...)

    rNews: metadata for news agencies (AFP,

    Reuters...)Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    17/68

    How is it related tothe Web?

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    18/68

    The traditional Web

    A principle: hypertext

    A protocol: HTTP

    An identification scheme: URNs/URIs

    A language: HTML

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    19/68

    To a computer, then, the web is a flat,

    boring world devoid ofmeaning

    Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011

    http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/
  • 8/6/2019 Solutions Linux 2011 Merged

    20/68

    This is a pity, as in fact documents on the

    web describe real objects and imaginaryconcepts, and give particular relationships

    between them

    Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011

    http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/
  • 8/6/2019 Solutions Linux 2011 Merged

    21/68

    Adding semantics to the web involves two things:

    allowing documents which have information inmachine-readable forms, and allowing links to be

    created with relationship values.

    Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011

    http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/
  • 8/6/2019 Solutions Linux 2011 Merged

    22/68

    The Semantic Web is not a separate Web but an

    extension of the current one, in which informationis given well-defined meaning, better enabling

    computers and people to work in cooperation.

    Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/Wednesday, May 11, 2011

    http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/http://www.w3.org/Talks/WWW94Tim/
  • 8/6/2019 Solutions Linux 2011 Merged

    23/68

    The traditional Web

    A principle: hypertext

    A protocol: HTTP

    An identification scheme: URNs/URIs

    A language: HTML

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    24/68

    The semantic Web

    A principle: hypertext

    A protocol: HTTP

    An identification scheme: URNs/URIs

    A language: HTML RDF

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    25/68

    The W3C Layer Cake

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    26/68

    The W3C Layer Cake

    Alreadystandardized

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    27/68

    URIs and the

    Web of Things

    URIs (Unique Resource Identifiers) are

    used to identify things (also called

    entities) in the real world

    For instance: people, places, events,

    companies, products, movies, etc.

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    28/68

    The RDF model

    Subject ObjectPredicate

    RDF is used to describe relationships

    between objects, identified by their URIs

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    29/68

    Example

    Source: http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouver

    Wednesday, May 11, 2011

    http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouverhttp://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-web-30-linked-data-quelques-repres-pour-sy-retrouver
  • 8/6/2019 Solutions Linux 2011 Merged

    30/68

    RDF serialization

    As XML:

    Others, ex: N3:

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    31/68

    SPARQL

    Query language for RDF databases

    Several implementations

    OSS: Apache Jena, Sesame, 4Store,

    Virtuoso, Mulgara, Redland, Open Anzo...

    Proprietary: 5Store, AllegroGraphRDFStore, Stardog, Dydra, OWLIM...

    More expressive than SQL, scalability is still

    an open question

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    32/68

    SPARQL Sample

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    33/68

    Where and howto find these data?

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    34/68

    Solution 1: Lift

    One can use HTML scrapping and naturallanguage processing (NLP) technique toextract semantic information from existingcontent / sites

    Generic solutions: OpenCalais, Zemanta,Apache Stanbol

    Pro: no need to change existing content

    Con: error rone, needs human checks

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    35/68

    Example: DBPedia

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    36/68

    Solution 2: export

    RDFa and microformats are used to embed

    semantic information (expressed using the

    RDF model) into regular web pages

    RDFa does it using existing (rel) and

    additional (about, property, typeof)

    attributes Microformats only use usual HTML

    attributes (class)

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    37/68

    Solution 3: reuse

    Linked Open Data: (usually large) data

    repositories available on the web (for freeor not), expressed using the RDF model

    Interoperability between these repositories

    (their ontologies) must be defined

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    38/68

    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

    Linked Open Data in 2007

    Wednesday, May 11, 2011

    http://lod-cloud.net/http://lod-cloud.net/
  • 8/6/2019 Solutions Linux 2011 Merged

    39/68

    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

    2008

    Wednesday, May 11, 2011

    2009

    http://lod-cloud.net/http://lod-cloud.net/
  • 8/6/2019 Solutions Linux 2011 Merged

    40/68

    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

    2009

    Wednesday, May 11, 2011

    http://lod-cloud.net/http://lod-cloud.net/http://lod-cloud.net/
  • 8/6/2019 Solutions Linux 2011 Merged

    41/68

    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

    2010

    Wednesday, May 11, 2011

    http://lod-cloud.net/http://lod-cloud.net/
  • 8/6/2019 Solutions Linux 2011 Merged

    42/68

    Good for Enterprise apps too!

    Diagram source: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    43/68

    Why now?

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    44/68

    Key Enablers

    Open Data and Linked Online Data

    Advances in automatic content analysis

    (linguistics, image processing) and machinelearning

    Classical logic and classical AI

    Computing power (Moores law +

    MapReduce)

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    45/68

    Lets put them to use!

    The technologies and data

    are available,

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    46/68

    2. Nuxeo &Semantic ECM

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    47/68

    Nuxeo: an open source

    ECM vendorOur Focus is Enterprise Content Management

    ECM as a Platform for Content Applications

    Open Source as Efficient Development Model

    Modern architecture for 21st Centurybusiness

    Lean, mobile, social, interoperable

    ASocial Marketplace in action

    Innovation driven by community of customers, partners,

    and our core developers

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    48/68

    45

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    49/68

    Ma or Customers

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    50/68

    47

    Goals for Semantic ECM

    Repurpose existing content better

    Improve search and collaboration

    Make information more contextual

    Extract and use information from content

    Leverage Open and Linked Data, contribute

    Make ECM users content smarter!

    > Gain efficiency, effectiveness and strategic

    positioning on the ECM market

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    51/68

    48

    Demo

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    52/68

    49

    IKS project

    European project under theFP7, with 13 partners (6 SMEs) and a 8.5 MEURbudget

    Goal: create a semantic software stack that will be

    used by CMS vendors to add semantic features totheir products

    Started in Jan. 2009, will last until Dec. 2012

    First tangible result: Apache Stanbol, alreadyintegrated in a Nuxeo plugin

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    53/68

    50

    The Semantic Engine

    From unstructured content to Knowledge

    Language guessing

    Topic classification (Business, Sports, Media, ...)

    Named Entities extraction and linking

    Relationships and properties extraction

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    54/68

    51

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    55/68

    52

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    56/68

    53

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    57/68

    54

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    58/68

    55

    =

    Semantic Engines(Apache OpenNLP)

    +Fast Linked Data local index

    (Apache Solr)+

    Semantic Rule Engine

    (Apache Jena)Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    59/68

    56

    12

    3

    DBpedia

    Freebase

    Geonames

    LDAP

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    60/68

    3. Apache tools for

    processingBig and/or Linked Data

    Wednesday, May 11, 2011

    Training statistical models for NER with

  • 8/6/2019 Solutions Linux 2011 Merged

    61/68

    58

    Training statistical models for NER with

    Wikipedia and DBpedia

    Extract sentences with link positions in Wikipedia articles

    DBPedia to the find type of the target entity(Person,

    Location, Organization)

    Apache Pig scripts to compute thejoin + format the result as

    training files for OpenNLP

    Apache OpenNLP to build and evaluate the models

    Apache Hadoop for distributed processing

    Apache Whirr for deployment and management on Amazon

    EC2 cluster

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    62/68

    59

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    63/68

    60

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    64/68

    61

    Wednesday, May 11, 2011

  • 8/6/2019 Solutions Linux 2011 Merged

    65/68

    62

    Wednesday, May 11, 2011

    Training statistical models for topic

  • 8/6/2019 Solutions Linux 2011 Merged

    66/68

    63

    Training statistical models for topic

    classification from Wikipedia and DBpedia

    Filter category tree from DBpedia SKOS entries (~500k)

    Pig scripts to compute thejoins with articles abstracts for all

    the articles categorized in Wikipedia

    Export as 2.8GB TSV file to be indexed inApache Solr

    Use Solr MoreLikeThisHandler to find the top 5 most related

    Wikipedia category for any kind of text

    Apache Whirr & Hadoop for deployment and management on

    Amazon EC2 cluster

    Wednesday, May 11, 2011

    Wh t t?

  • 8/6/2019 Solutions Linux 2011 Merged

    67/68

    64

    Whats next?

    Integrate the R&D results into Stanbol / Nuxeo

    Work on user interface / high level javascript toolkits for Linked

    Data editing

    http://github.com/bergie/VIE based on backbone.js

    Experiment / Integrate / Refine

    Wednesday, May 11, 2011

    R

    http://github.com/bergie/VIEhttp://github.com/bergie/VIEhttp://incubator.apache.org/projects/whirr.html
  • 8/6/2019 Solutions Linux 2011 Merged

    68/68

    Resources

    http://iks-project.eu

    http://stanbol.demo.nuxeo.com

    http://incubator.apache.org/stanbol

    http://blogs.nuxeo.com/dev

    http://hadoop.apache.org/

    http://incubator.apache.org/opennlp/

    http://github.com/ogrisel/pignlproc

    http://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/opennlp/http://hadoop.apache.org/http://blogs.nuxeo.com/devhttp://incubator.apache.org/stanbolhttp://fise.demo.nuxeo.com/http://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/projects/whirr.htmlhttp://incubator.apache.org/opennlp/http://incubator.apache.org/opennlp/http://hadoop.apache.org/http://hadoop.apache.org/http://blogs.nuxeo.com/devhttp://blogs.nuxeo.com/devhttp://incubator.apache.org/stanbolhttp://incubator.apache.org/stanbolhttp://fise.demo.nuxeo.com/http://fise.demo.nuxeo.com/http://iks-project.eu/http://iks-project.eu/