ontology-based annotation sergey sosnovsky @paws@sis@pitt
TRANSCRIPT
Why Do We Need Annotation Annotation-based Services
Integration of Disperse Information (knowledge-based linking) Better Indexing and Retrieval (based on the document semantics) Content-based Adaptation (modeling document content in terms of domain
model) Knowledge Management
Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …)
Collaboration Support Knowledge sharing and communication
What is Added by O-based Annotation Ontology-driven processing (effective formal reasoning) Connecting other O-based Services (O-mapping, O-visualization…) Unified vocabulary Connecting to the rest of SW knowledge
Definition
O-based Annotation is a process ofcreating a mark-up of Web-documents using a pre-existing ontology
and/orpopulating knowledge bases by marked up documents
“Michael Jordan plays basketball”
our: Athleteour: plays
our: Sports
Michael Jordan Basketballour: plays
rdf: type rdf: type
List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas …
Information Extraction Tools:• Alembic• Amilcare / T-REX• Annie• Fastus• Lasie• Poteus• SIFT• …
Important Characteristics
Automation of Annotation(manual / semiautomatic / automatic / editable)
Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations
/ triples); where annotations are stored (in the annotated document / on the
dedicated server / where specified) annotation format (XML / RDF / OWL / …).
Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web)
Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / …
Annotation Scale (large – the WWW size / small - a hundred) Existing Documentation / Tutorial Availability
SMORE
Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated
pages and suggests as candidates for the mark-up Post-annotation O-based Inference
“Michael Jordan plays basketball”
our: Athleteour: plays
our: Sports
Michael Jordan Basketballour: plays
rdf: type rdf: type
Problems of Manual Annotation Expensive / Time-consuming Difficult / Error prone Subjective (two people annotating the same documents have
in 15–30% annotate them differently) Never ending
new documents new versions of ontologies
Annotation storage problem where?
Trust owner’s annotation incompetence Spam (Google does not use <META> info)
Solution: Dedicated Automatic Annotation Services (“Search Engine”- like)
Automatic O-based Annotation Supervised
MnM S-Cream Melita & AktiveDoc
Unsupervised SemTag - Seeker Armadillo AeroSWARM
MnM
Ontology-based Annotation Interface: Ontology browser (rich navigation capabilities) Document browser (usually Web-browser) The annotation is mainly based on select-drag-N-drop
association of text fragments with ontology elements Built-in or External ML component classifies the main corpus
of documents Activity Flow:
Markup (A human user manually annotate training set of documents by ontology elements)
Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules)
Extract (An IE mechanism is selected and run over a set of documents)
Review (A human user observes the results and correct them if necessary)
Amilcare and T-REX
Amilcare: Automatic IE component Is used in at least five O-based A tools (Melita,
MnM, Ontoannotate, Ontomat, SemantiK) Released to about 50 Industrial and Academic
sites Java API Recently succeeded by T-REX
Input: A web page. Step 1: Web page is scanned for phrases that might be categorized as instances of
the ontology (partof-speech tagger to find candidate proper nouns) Result 1: set of candidate proper nouns
Step 2: The system iterates through all candidate proper nouns and all candidate ontology concepts to derive hypothesis phrases using preset linguistic patterns. Result 2: Set of hypothesis phrases.
Step 3: Google is queried for the hypothesis phrases through Result 3: the number of hits for each hypothesis phrase.
Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts Result 4: an ontologically annotated web page.
Pankow
SemTag - Seeker IBM-developed ~264 million web pages ~72 thousand of concepts (TAP taxonomy) 434 million automatically disambiguated semantic tags
Spotting pass Documents are retrieved from the Seeker store, and tokenized Tokens are matched against the TAP concepts. Each resulting label is saved with ten words to either side as a ``window'' of
context around the particular candidate object. Learning pass
A representative sample of the data is scanned to determine the corpus-wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used.
Tagging pass “Windows” are scanned once more to disambiguate each reference
determine an TAP object A record is entered into a database of final results containing the URL, the
reference, and any other associated metadata.
Conclusions Web-document A is a necessary thing O-based A benefits (O-based post-processing, unified
vocabularies, etc.) Manual A is a bad thing Automatic A is a good thing:
Supervised O-based A: Useful O-based interface for annotating training set Traditional IE tools for textual classification
Unsupervised O-based A: COHSE – matches concept names from the ontology and a
thesaurus against tokens from the text Pankow – uses ontology to build candidate queries, then uses
community wisdom to choose the best candidate SemTag – uses concept names to match tokens and hierarchical
relations in the ontology to disambiguate between candidate concepts for a text fragment