linked data based semantic annotation using drupal and apache stanbol
DESCRIPTION
My presentation from Drupalaton 2013 - http://drupalaton.hu/schedule#speaker-30 This session will focus on the implementation of semantic services (automatic content enhancement, autotagging, content recommendation, reasoning) based on linked data datasets using the integration of Drupal with Apache Stanbol. During the presentation the audience will find out about: main features of Apache Stanbol and its integration with Drupal how to discover and use custom/domain specific Linked Data datasets with Apache Stanbol/Drupal how to build an advanced semantic processing chain in Apache Stanbol that will automatically annotate Drupal entities how to implement a content recommendation/reasoning feature for Drupal based on Apache Stanbol services. Apache Stanbol is an Open Source software stack designed to provide a powerful semantic engine via RESTful services returning results as RDF (Resource Description Language) and JSON. Unlike existing proprietary, commmerically oriented solutions such as OpenCalais, Apache Stanbol is highly customizable and may be trained to provide semantic services for virtually any language.TRANSCRIPT
Drupal and Apache StanbolLINKED DATA BASED SEMANTIC ANNOTATION
Gabriel Dragomir
Sunday, August 18, 13
The Semantic Web
Tim Berners Lee:
‘‘The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a Web of data that can be processed directly or indirectly by machines.’’
Sunday, August 18, 13
What’s the hype?Most organizations need to organize/analyze/relate huge amounts of textual, unstructured, dissipated data
Examples:
keyword extraction from content: annotate abstracts
text categorization: organize big volumes of text based on a thesaurus
media monitoring of tags: occurences of a specific keyword on social media channels
Sunday, August 18, 13
Linked data
Project started in 2007
Aimed at building the Web of Data by:
identifying open access data sets
converting them into RDF vocabularies
publish them as open access data sets
Sunday, August 18, 13
Linked data ecosystem
Linked Open Vocabularies (LOV): http://lov.okfn.org/dataset/lov/
Provides a conceptual map of the vocabularies
Various providers: libraries, governmental actors, NGOs
Sunday, August 18, 13
Linked data ecosystem
Where to find other data sets?
http://www.w3.org/2001/sw/wiki/SKOS/Datasets
Swoogle: http://swoogle.umbc.edu/
PoolParty: http://vocabulary.semantic-web.at
Sunday, August 18, 13
Linked data at work!
Sunday, August 18, 13
Semantic annotation
Creates specific metadata that enable new ways to retrieve and aggregate information
Annotations are done based on a conceptual scheme, an ontology (ex. FOAF, DC Core)
For more on ontologies see: http://www.w3.org/wiki/Good_Ontologies
The annotations build semantic relationships: e.g. rdf:type, owl:sameAs
Sunday, August 18, 13
Semantic annotation
Most common uses:
Named Entity Linking: limited recognizing entities of type person, organization, place (e.g. OpenCalais)
Entityhub Linking: annotation based on vocabularies with no limitations of entity types. Requires more natural language processing prior to annotation.
Sunday, August 18, 13
Apache Stanbol on the fly
Here comes Apache Stanbol
A new approach:
modular semantic analysis of documents
processing components can be built for virtually any language
flexible workflows via semantic annotation chains
any vocabulary (Linked Data, custom) can be used
Sunday, August 18, 13
From IKS to Apache Stanbol
IKS - Interactive Knowledge Stack for small to medium CMS providers - EU funded consortium
An open source software stack written in Java
Goal: extract and process semantic data from documents
Project undergoing incubation at Apache Foundation
http://stanbol.apache.org
Sunday, August 18, 13
Service oriented architecture
Stanbol is designed to offer service oriented integration
RESTful web services API returning RDF or JSON/JSON-LD
Each component exposes an endpoint independently
Open Services Gateway initiative compliant (OSGi) via Apache Felix and Apache Sling
Remote component management
Sunday, August 18, 13
ImplementationOSGi layer: Apache Felix and Apache Sling
Build environment: Apache Maven
RDF framework: Apache Clerezza
Triples store, reasoning engine: Apache Jena
Indexing and semantic search: Apache Solr
Content analysis/metadata extraction: Apache Tika
Natural language processing: Apache OpenNLP
Sunday, August 18, 13
Architecture
Sunday, August 18, 13
Components
Semantic layer:
Enhancer, EntityHub, ContentHub
Enhancement engines: internal, 3rd party
User interfaces
Knowledge integration (rule sets, reasoners)
Storage integration
Sunday, August 18, 13
Content enhancement
Examples:
retrieve additional metadata for a piece of content
identify the language of a text
extract entities (persons, places, organizations)
create annotations to external sources
use 3rd party services for named entities recognition
Sunday, August 18, 13
Drupal meets Stanbol
Several modules implement RDF support allowing data transport to Stanbol semantic annotations
Taxonomy system allows for complex annotation
Fieldable taxonomy terms allow for storage of complex semantic data
Sunday, August 18, 13
User scenarios
Semantic indexing via Stanbol (SOLR yard)
Content enrichment with semantically related information (documents, factual data, images etc.)
Tag as you type: dynamic annotation of text in editors
Sunday, August 18, 13
How it worksPOST request sends content via REST API
content is processed by an enhancement chain
Returns JSON-LD, RDF/XML, RDF/JSON etcJSON-LD - JavaScript Object Notation for Linked Data a human readable and simple linked data transport format
for best results an enancement chain should do language detection, tokenization, POS Tagging prior to performing semantic annotation
http://drupalaton.jelastic.dogado.eu/stanbol/enhancer Sunday, August 18, 13
Drupal integration
Source: blog.iks-project.eu
Sunday, August 18, 13
Drupal distribution: IKS CEIKS CE distribution - Wolfgang Ziegler (fago), Stéphane Corlosquet (scor)
Components:
Search API Stanbol
VIE.js - semantic annotation UI
https://drupal.org/project/iksce
http://drupal.org/project/vie
http://drupal.org/project/search_api_stanbol
Sunday, August 18, 13
Search API Stanbol
enables the indexing of Drupal entities such as nodes, users, taxonomy terms, files, etc. in Stanbol EntityHub.
data sent as RDF
data can be mashed up with data from other sources (Managed Sites, Remote Sites)
Sunday, August 18, 13
VIE.js
“Vienna IKS Editables”
JavaScript library for implementing decoupled Content Management Systems and semantic interaction in web applications.
Sunday, August 18, 13
Monolitic vs Decoupled Content Management
Monolitic vs Decoupled Content Management Systems
source: Henri Bergius - http://bergie.iki.fiSunday, August 18, 13
Demo setup
we store Drupal entities in a SOLR index
annotations are to be made based on:
DBPedia - bundled with Apache Stanbol
a custom vocabulary of terms related to semantic web - Social Semantic Web Thesaurus
SemWeb is imported as a SOLR index into Apache Stanbol
Sunday, August 18, 13
Custom vocabularies
Social Semantic Web Thesaurus
1959 concepts related to semantic web
Author: Andreas Blumauer
http://vocabulary.semantic-web.at/semweb.html
http://vocabulary.semantic-web.at/semweb/8.visual
Sunday, August 18, 13
Demo
index Drupal entities in Apache Stanbol
retrieve annotated entites via REST API
annotate entities using dbpedia and semweb indexes
edit Drupal entities and annotate on the fly
retrieve linked data tag recommendations
Sunday, August 18, 13
Questions?
Sunday, August 18, 13
Contact me
twitter: gabidrg
Sunday, August 18, 13
Thank you!
Sunday, August 18, 13
http://mures2013.drupalcamp.ro
Sunday, August 18, 13