exploiting semantic web techniques for representing and utilising
TRANSCRIPT
Exploiting Semantic Web Techniques for Representing and Utilising FolksonomiesOwen Sacco
page 2
Presentation Map
page 3
Presentation Map
Introduction• Aim & Goals
The Semantic Web• Meta Formats, Vocabularies & Query Language
Web 2.0• Web 2.0 Technologies & Applications
Folksonomies• Tags, Tagging, Representing Tags Semantically &
Integrating Folksonomies with the Semantic Web
Presentation Map
Graph Mining Techniques• Fast Unfolding of Communities in Large Networks
State of the Art Tool• Examining the Edge List• The Community Structure Ontology• Jena & Corese• Creating & Querying RDF Statements• Analysis & Results
Conclusion• Enhancements & Future Work
page 4
Introduction
page 5
Introduction
The research is about:• Understanding various Semantic Web technologies
for representing data semantically• Understanding Folksonomies and how to
semantically represent them• To semantically represent tags retrieved from
Bibsonomy (http://www.bibsonomy.org/)
- The tags have been hierarchically structured using the algorithm “fast unfolding of communities in large networks”
• Use Semantic Web technologies to create and exploit such representation of tags
page 6
The Semantic Web
page 7
The Semantic Web
page 8
What is the Semantic Web?• Not a separate Web • An extension of the current Web• Semantic = Meaning• Semantic Web = Meaningful Data• Meaning is data about data, i.e. Metadata
Advantages of Semantic Web:• Information is given well-defined meaning • Better enabling computers• People to work in cooperation Source: W3C
Semantic Web
The Semantic Web
Resource Description Framework (RDF)• A framework that describes resources on the WWW• Suitable for merging data on the Web• Resources are uniquely identified by URLs• The RDF Model is made up of triple statements• Triple Statements: Subject, Predicate & Object
page 9
SUBJECT OBJECTPREDICATE
The Semantic Web
• An RDF Model can be serialised in RDF/XML• An example of RDF document
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"> <contact:Person rdf:about="http://www.w3.org/People/EM/contact#me">
<contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:[email protected]"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>
</rdf:RDF>
Source: W3C RDF Primer
page 10
The Semantic Web
Ontology• “A formal explicit specification of a shared
conceptualisation”• In other words: parties having a common concept of
data agree and specify clearly as possible such concepts
• It is an enabling technology for information sharing and manipulation
• A vocabulary for RDF documents• Ontologies are based on RDF models and are
expressed by using the Web Ontology Language
page 11
The Semantic Web
SPARQL – An RDF Query Language• Query in the Semantic Web context means:
“Technologies and protocols that programmatically retrieve information from the Web of Data”.
• Based on triple patterns similar to RDF triples• A query returns resources for all RDF triples that
match the query’s pattern• Is used to return complex data for mash-ups or
search engines containing semantic data• Syntax is similar to SQL
Source: W3C
page 12
Web 2.0
page 13
Web 2.0
A “Read/Write” Web Web 2.0 has:• Facilitated web design• Provided attractive, rich, easy-to-use interfaces• Assisted in reuse of data by merging information
from various sources• Created social networks of people
According to Internet World Stats, between 2000 and 2003 users doubled thanks to Friendster (one of the first social network websites)
Source: Internet World Stats - Internet Growth Statistics
page 14
Web 2.0
Web 2.0 is considered a Social Web• People are more involved by collaborating & sharing
data One of the major Web 2.0 technologies for web
development is AJAX• A combination of several technologies:
- HTML or XHTML
- Cascading Style Sheets (CSS)
- Java Script
- XML
page 15
Web 2.0
Web 2.0 created new application concepts:• Blogs (Blogger, WordPress)• Wikis (Wikipedia)• Really Simple Syndication, RSS• Mashups (MusicMesh, BBC Music)• Social Networks (Facebook, LinkedIn, MySpace)• Social Bookmarking (delicious, Bibsonomy)• Photo Sharing (Flickr)• Video Sharing (YouTube, Vimeo)
In most of these concepts you find Tagging!
page 16
Folksonomies
page 17
Folksonomies
Tag• “A non-hierarchical keyword or term”
Tagging• “Assign a tag to a piece of information or resources”
Tagger• “The person that assigns the tag”
Folksonomy• “The result of personal free tagging of information
and objects for one’s own retrieval. The tagging is done in a social environment.” Thomas Vander Wal (2004)
page 18
Folksonomies
Tag Cloud• a visualisation of popular tags • popular tags stem out from others by being in larger
font or emphasised
page 19
Folksonomies
page 20
Where can we tag?• Social Bookmarking websites
Folksonomies
• Picture sharing websites
page 21
Folksonomies
• Video sharing websites
page 22
Folksonomies
Why tagging?• It’s Popular
- Nowadays, practically anyone who uses a computer or the Internet is exposed to tagging in some way.
• It’s Social
- Through the most popular tags, we can see a kind of rough consensus on the subject of the resource.
• It’s Flexible
- Ad-hoc, free-form and does not adhere to any strict classification scheme or vocabulary.
page 23
Folksonomies
Basic Model• Taggers create the tags,
and sometimes they add resources.
• If we can identify something, then it can be tagged.
• Tagging is open-ended, tags can be any kind of term.
page 24
Source: Smith G. 2008. Tagging People-Powered Metadata for the Social Web
Folksonomies
How about:• Collaborative sharing tags across multiple
applications• Collaborative filtering based on tagging• Connecting people based on tagging
All these can be achieved through Tag Ontologies• Ontology is not a taxonomy• Ontology makes semantic agreement• Semantic agreement enables useful composition
page 25
Folksonomies
Richard Newman’s Tag Ontology
page 26
Source: Haklae Kim et al., Review and Alignment of Tag Ontologies for Semantically-Linked Data in Collaborative Tagging Spaces
Folksonomies
Tom Gruber’s Conceptual Model
Tagging(object, tag, tagger, source, + or -)
page 27
Source: Gruber T., Ontology for Folksonomy: A Mash-Up of Apples and Oranges.
Folksonomies
Limitations of tagging:• Ambiguity of tags (example: apple is it a fruit or the
computer company?)• Lack of synonymy (example: lorry or truck)• Discrepancies in granularity (example: java vs
programming language)• Flat Organisation of Folksonomy
How do we overcome these?• Use: CommonTag, MOAT, SCOT
page 28
Folksonomies
CommonTag• To add concepts to tags from databases such as
Freebase and DPPedia
page 29
Source: CommonTag
Folksonomies
Meaning Of A Tag (MOAT)• An ontology to represent how different meanings
(URIs of semantic Web resources) can be related to a tag
• Extends the Tag class from Richard Newman’s tag ontology
• Tagging (User, Resource, Tag, Meaning)• Architecture of MOAT Framework:
- MOAT server stores different meanings that can be queried
- MOAT client interacts with the server to let users easily annotate their content
page 30
Folksonomies
Social Semantic Cloud of Tags (SCOT)• An ontology aimed to represent set of tags• Built on top of Richard Newman’s Tag Ontology
page 31
Source: SCOT: Let's Share Tags!
Folksonomies
Limitations of the previous ontologies:• An extra step is being added to the tagging activity• Isn’t it daunting for the user when presented with a
list of meanings to choose from? • Which meaning shall the user choose?• Will tagging remain popular with this additional step?• If an automatic process is used to select a meaning
of a tag, how accurate can this process be? • Can this process really understand the user at that
instance?
page 32
Folksonomies
• With this additional meaning, isn’t tagging becoming another “strict” classification scheme?
• Can relationships of tags really be built on meanings?
• How about using some form of algorithm that can unfold new relationships of tags?
page 33
Fast Unfolding of Communities in Large Networks
page 34
Fast Unfolding of Communities in Large Networks
A recursive method to extract the community structure of large networks
This method is based on modularity optimisation The modularity is a scalar value that measures the
density of links inside communities as compared to links between communities
It unfolds a complete hierarchical community structure for large networks in a short time
Results have shown that on a network of 118 million nodes, the algorithm took 152 minutes
page 35
Source: Blondel V.B. et al. 2008. Fast unfolding of communities in large networks
Fast Unfolding of Communities in Large Networks
The algorithm consists of two phases which are iterated until a maximum modularity is attained.
First, all nodes are assigned to different communities.
Then each node is compared with its neighbours. The node is placed in the community which yields a maximum gain in modularity.
This process is repeated for all nodes until no further movement can be attained.
The second phase consists of building a network whose nodes are now the communities found during the first phase.
page 36
Fast Unfolding of Communities in Large Networks
After the second phase, the process starts again with the first phase
A “pass” denotes a combination of both passes The “passes” are iterated until there are no more
changes and the maximum modularity is reached for the whole network
The height of the network denotes in the number of passes
At the end, a hierarchical structure is attained that consists of communities of communities.
page 37
Fast Unfolding of Communities in Large Networks
page 38
State of the Art Tool
page 39
State of the Art Tool
The Data• It is provided beforehand• Consists of a hierarchical structure made up of
communities of communities of related tags • This hierarchical structure is constructed using the
“Fast Unfolding of Communities in Large Networks” algorithm
• The tags are from the Social Bookmarking Website Bibsonomy (http://www.bibsonomy.org/)
• The aim for using the community structure algorithm is to unfold new relationships amongst tags
page 40
State of the Art Tool
• A visualisation of tagging graph that depicts the relationships amongst tags
page 41
State of the Art Tool
The Input to the system will consist of Edge Lists Each Edge List file consists of a pass 4 Edge List files were used for this system: • The first list is a plain list of related tags queried from
Bibsonomy• The other three lists denote communities or
communities of communities computed from the community structure algorithm
Each relation (line) in each of the Edge List file consists as follows:• The first edge list: <tagi, tagj, weight>
page 42
State of the Art Tool
• The other three edge lists:
- <communityi, tagj, weight> or
- <communityi, communityj, weight> The Edge List files contain:• The first (lower level): 13126 nodes with 264718
edges• The second (first pass): 529 nodes with 6337 edges• The third (second pass): 65 nodes with 374 edges• The fourth (third pass): 50 nodes with 207 edges
page 43
State of the Art Tool
A sample from one of the edge lists (the lower level file)
caching,offlinebrowser,2.0
caching,archiving,2.0
institutions,activity,1.0
malian,senegal,2.0
malian,northern,2.0
malian,guinea,2.0
malian,drummers,2.0
cdf,c,1.0
cdf,library,1.0
page 44
State of the Art Tool
First Task: To semantically represent all edge lists that represent the hierarchical structure
Since the lower level edge list is made up of a set of tags, then the tags will be described using the SCOT ontology
But to represent the hierarchical structure of communities, a new ontology must be designed that needs to be built on top of SCOT and also, the new ontology must be linked to SCOT
page 45
State of the Art Tool
The Community Structure Ontology
page 46
CommunityStructure
Community CommunityAggregation
name
scot:Tag
Linkage sioc:Resource
CommunitylinkWeight
modularity pass
UnfoldedCommunity UnfoldingActivity
linkedWith
linkedTag
associatedCommunity
communityOf
State of the Art Tool
Ontology was designed with a tool called Protege – A Java application for designing Ontolotgies
Ontology built on OWL2 Classes: CommunityStructure, Community,
CommunityAggregation, Linkage Object properties: associatedCommunity,
communityOf, linkedIn, linkedTag, linkedWith, unfoldedCommunity, unfoldingActivity
Data properties: communityName, linkWeight, modularity, pass
page 47
State of the Art Tool
Second Task: To create an application that will transform the edge lists to RDF/XML statements and store the documents on physical storage. Also, a query engine will be included into the application to query the RDF/XML statements.
The application is developed using the Java programming language.
For the creation of RDF/XML statements and to write such statements to physical storage, a widely used API is embedded in the system. This API is called the JENA API
page 48
State of the Art Tool
Jena – A Semantic Web Framework• Developed by HP• An RDF API for reading and writing RDF models in
RDF/XML• An OWL API for reading and writing OWL ontologies• In-memory and persistent storage for writing
RDF/XML statements to memory or physical storage such as text files or even relational databases
• SPARQL query engine
page 49
State of the Art Tool
The Tool
page 50
State of the Art Tool
The tool provides the following features:• Properties to setup:
- The Edge List Directory
- The Edge List File Structure
page 51
State of the Art Tool
- Settings to setup the type of storage required
– RDF/XML documents
page 52
State of the Art Tool
– Relational database persistent storage
– A TDB storage, a custom fast persistent storage
page 53
State of the Art Tool
Properties to setup the Ontologies
page 54
State of the Art Tool
The Method to transform the edge list to RDF Statements:• First, the edge lists are merged together and
ordered according to their hierarchical structure• Second, the RDF Model consisting of RDF
statements are created according to the Community Structure and SCOT Ontologies
• Third, RDF statements are written according to the settings setup.
page 55
State of the Art Tool
Writing of RDF Statements• RDF Documents:
- For whole documents: the whole document is written after the whole model is created
- For split documents: documents are written after the model for each community is created.
– Two index lists are created, one A-Z and an other to indicate where each community document is located
page 56
State of the Art Tool
Writing of RDF Statements• RDF Persistent Storage
- RDB Method: MySQL is used as a persistent relational databases and RDF statements are written on-the-fly, i.e. After each statement is created, these are written in the database
- TDB Method: each statement is written on-the-fly as well
page 57
State of the Art Tool
Writing of RDF Statements (Results)
page 58
Storage Method DurationRDF Documents (Whole Documents) 1.9 minutes
RDF Documents (Split Documents) 2.2 minutes
RDF Persistent Storage (RDB Method) 35.2 minutes
RDF Persistent Storage (TDB Method) 3.2 minutes
State of the Art Tool
Querying Statements• For RDF Documents Corese SPARQL Engine was
used
- Corese SPARQL Engine is developed by Edelweiss
- Built on top of Jena with some added enhancements such as Approximated Searches, Select Expressions
- Queries only RDF documents and does not have the capability of querying directly to relational databases
page 59
State of the Art Tool
Querying Statements• For Persistent Storage, the Jena SPARQL Engine is
used since Jena allows for direct querying Querying Methods• RDF Documents (Split Documents):
- First query index lists
- Get community document
- Query community document and get linked communities
- Query index list and query contents for each community
page 60
State of the Art Tool
Querying Methods• RDF Documents (Whole Documents)
- Query whole model and query for community
- Retrieve linked communities
- Query linked communities for their content• Persistent Storage
- Query whole model and query for community
- Retrieve linked communities
- Query linked communities for their content
page 61
State of the Art Tool
Querying Statements (Results)• Results are based on a community called malian• This community has 57 linked communities and 15
linked tags
page 62
Storage Method DurationRDF Documents (Whole Documents) Out of memory
RDF Documents (Split Documents) 43.3 minutes
RDF Persistent Storage (RDB Method) 7 seconds
RDF Persistent Storage (TDB Method) 1 second
State of the Art Tool
Other features• RDF Document Viewer
page 63
Conclusion
page 64
Conclusion
In this research we have seen the importance of Semantic Web and to describe semantically Web data
We have seen the importance of using folksonomies for search and exploration
Additionally, we have also seen various ontologies of how such folksonomies can be semantically represented
From community structure algorithms and graph mining techniques, new relationships amongst other tags can be unfolded
page 65
Conclusion
An ontology was designed and developed for the fast unfolding of communities in large networks
From this ontology, RDF/XML statements can be created and are linked to the SCOT ontology
We have seen that by using Triple Stores, persistent storage for triple statements is much faster for querying
page 66
Future Enhancements
page 67
Future Enhancements
To try this model on larger tag models from different websites
To include the tagger and links to the actual resource
To analyse these links that contribute to the linked data initiative
Optimise writing and querying based on larger models
page 68
page 69