background knowledge for ontology construction
DESCRIPTION
Background Knowledge for Ontology Construction. Bla ž Fortuna, Marko Grobelnik, Dunja Mladeni ć , Institute Jo ž ef Stefan, Slovenia. Documents are encoded as vectors Each element of vector corresponds to frequency of one word - PowerPoint PPT PresentationTRANSCRIPT
Background Background Knowledge for Knowledge for
Ontology Ontology ConstructionConstruction
Blaž Fortuna,Marko Grobelnik,Dunja Mladenić,
Institute Jožef Stefan, Slovenia
Bag-of-words• Documents are encoded
as vectors• Each element of vector
corresponds to frequency of one word
• Each word can also be weighted corresponding to the importance of the word
• There exist various ways of selecting word weights. In our paper we propose a method to learn them!
computer 2
mathematics 2
are 1
and 4
science 3
… …
computer 0.9
mathematics 0.8
are 0.01
and 0.01
science 0.9
… …
computer 1.8
mathematics 1.6
are 0.01
and 0.04
science 2.7
… …
Wo
rd W
eig
ts Important
Noise
Computers are used in increasingly diverse ways in Mathematics and the Physical and Life Sciences. This workshop aims to bring together researchers in Mathematics, Computer Science, and Sciences to explore the links between their disciplines and to encourage new collaborations.
SVM Feature selection
Input:• Set documents• Set of categories• Each document is assigned
a subset of categoriesOutput:• Ranking of words according to
importance
Intuition:• Word is important if it
discriminates documents according to categories.
Basic algorithm:• Learn linear SVM classifier for
each of the categories.• Word is important if it is
important for classification into any of the categories.
Reference:• Brank J., Grobelnik M., Milic-
Frayling N. & Mladenic D. Feature selection using support vector machines.
Word weight learningAlgorithm:1. Calculate linear SVM
classifier for each category2. Calculate word weights for
each category from SVM normal vectors. Weight for i-th word and j-th category is:
3. Final word weights are calculated separately for each document:
• The word weight learning method is based on SVM feature selection.
• Besides ranking the words it also assigns them weights based on SVM classifier.
Notation:• N – number of documents• {x1, …, xN} – documents• C(xi) – set of categories for
document xi
• n – number of words • {w1, …, wn} – word weights• {nj
1, …, njn} – SVM normal
vector for j-th category
N
kijik
ji nx
N 1,,
1
ixCj
jiik TFx
k
)(,
OntoGen system
• System for semi-automatic ontology construction
– Why semi-automatic?The system only gives suggestions to the user, the user always makes the final decision.
• The system is data-driven and can scale to large collections of documents.
• Current version focused on construction of Topic Ontologies, next version will be able to deal with more general ontologies.
• Can import/export RDF.
• There is a big divide between unsupervised and fully supervised construction tools.
• Both approaches have weak points:– it is difficult to obtain desired results
using unsupervised methods, e.g. limited background knowledge
– manual tools (e.g. Protégé, OntoStudio) are time consuming, user needs to know the entire domain.
• We combined these two approaches in order to eliminate these weaknesses:
– the user guides the construction process,
– the system helps the user with suggestions based on the document collection.
http://kt.ijs.si/blazf/examples/ontogen.html
Context
Topic
How does OnteGen help?By identifying the topics andrelations between them:… using k-means clustering:• cluster of documents => topic• documents are assigned to
clusters => ‘subject-of’ relation• We can repeat clustering on a
subset of documents assigned to a specific topic => identifies subtopics and ‘subtopic-of’ relation
By naming the topics:… using centroid vector:• A centroid vector of a given topic is
the average document from this topic (normalised sum of topic’s documents)
• Most descriptive keywords for a given topic are the words with the highest weights in the centroid vector.
… using linear SVM classifier:• SVM classifier is trained to seperate
documents of the given topic from the other document in the context
• Words that are found most mportant for the classification are selected as keywords for the topic
Suggestions of subtopics
Topic ontology visualization
Topic ontology
Topic Keywords
All documents
Selected topic
Outlier detection
Topic document
Topic ontology of Yahoo! Finances
Background knowledge in OntoGen
• All of the methods in OntoGen are based on bag-of-words representation.
• By using a different word weights we can tune these methods according to the user’s needs.
• The user needs to group the documents into categories. This can be done efficiently using active learning.
http://kt.ijs.si/blazf/examples/ontogen.html
Influence of background knowledge• Data: Reuters news articles• Each news is assigned two
different sets of tags:– Topics
– Countries
• Each set of tags offers a different view on the data
Topics viewTopics view
Countries viewCountries view
DocumentsDocuments
Links
• OntoGen:http://ontogen.ijs.si/
• Text Garden:http://www.textmining.net/