ontology-based annotation sergey sosnovsky @paws@sis@pitt

Ontology-based Annotation

Sergey Sosnovsky@PAWS@SIS@PITT

Outline

O-based Annotation Conclusion Questions

Why Do We Need Annotation Annotation-based Services

Integration of Disperse Information (knowledge-based linking) Better Indexing and Retrieval (based on the document semantics) Content-based Adaptation (modeling document content in terms of domain

model) Knowledge Management

Organization’s Repositories as mini Webs (Boeing, Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA, …)

Collaboration Support Knowledge sharing and communication

What is Added by O-based Annotation Ontology-driven processing (effective formal reasoning) Connecting other O-based Services (O-mapping, O-visualization…) Unified vocabulary Connecting to the rest of SW knowledge

Definition

O-based Annotation is a process ofcreating a mark-up of Web-documents using a pre-existing ontology

and/orpopulating knowledge bases by marked up documents

“Michael Jordan plays basketball”

our: Athleteour: plays

our: Sports

Michael Jordan Basketballour: plays

rdf: type rdf: type

List of Tools AeroDAML / AeroSWARM Annotea / Annozilla Armadillo AktiveDoc COHSE GOA KIM Semantic Annotation Platform MagPie Melita MnM OntoAnnotate Ontobroker OntoGloss ONTO-H Ont-O-Mat / S-CREAM / CREAM Ontoseek Pankow SHOE Knowledge Annotator Seeker Semantik SemTag SMORE Yawas …

Information Extraction Tools:• Alembic• Amilcare / T-REX• Annie• Fastus• Lasie• Poteus• SIFT• …

Important Characteristics

Automation of Annotation(manual / semiautomatic / automatic / editable)

Ontology-related issues: pluggable ontology (yes/no); ontology language (RDFS / DAML+OIL / OWL / …); local / anywhere access; ontology elements available for annotation (concept / instances / relations

/ triples); where annotations are stored (in the annotated document / on the

dedicated server / where specified) annotation format (XML / RDF / OWL / …).

Annotated Documents: document kinds (text / multimedia) document formats (plain text / html / pdf / …) documents access (local / web)

Architecture / Interface / Interoperability Standalone tool / web interface / web component / API / …

Annotation Scale (large – the WWW size / small - a hundred) Existing Documentation / Tutorial Availability

SMORE

Manual Annotation OWL-based Markup Simultaneous O modification (if necessary) ScreenScraper mines metadata from annotated

pages and suggests as candidates for the mark-up Post-annotation O-based Inference

“Michael Jordan plays basketball”

our: Athleteour: plays

our: Sports

Michael Jordan Basketballour: plays

rdf: type rdf: type

Problems of Manual Annotation Expensive / Time-consuming Difficult / Error prone Subjective (two people annotating the same documents have

in 15–30% annotate them differently) Never ending

new documents new versions of ontologies

Annotation storage problem where?

Trust owner’s annotation incompetence Spam (Google does not use <META> info)

Solution: Dedicated Automatic Annotation Services (“Search Engine”- like)

Automatic O-based Annotation Supervised

MnM S-Cream Melita & AktiveDoc

Unsupervised SemTag - Seeker Armadillo AeroSWARM

MnM

Ontology-based Annotation Interface: Ontology browser (rich navigation capabilities) Document browser (usually Web-browser) The annotation is mainly based on select-drag-N-drop

association of text fragments with ontology elements Built-in or External ML component classifies the main corpus

of documents Activity Flow:

Markup (A human user manually annotate training set of documents by ontology elements)

Learn (A learning algorithm is run over the marked up corpus to learn the extraction rules)

Extract (An IE mechanism is selected and run over a set of documents)

Review (A human user observes the results and correct them if necessary)

Amilcare and T-REX

Amilcare: Automatic IE component Is used in at least five O-based A tools (Melita,

MnM, Ontoannotate, Ontomat, SemantiK) Released to about 50 Industrial and Academic

sites Java API Recently succeeded by T-REX

Input: A web page. Step 1: Web page is scanned for phrases that might be categorized as instances of

the ontology (partof-speech tagger to find candidate proper nouns) Result 1: set of candidate proper nouns

Step 2: The system iterates through all candidate proper nouns and all candidate ontology concepts to derive hypothesis phrases using preset linguistic patterns. Result 2: Set of hypothesis phrases.

Step 3: Google is queried for the hypothesis phrases through Result 3: the number of hits for each hypothesis phrase.

Step 4: The system sums up the query results to a total for each instance-concept pair. Then the system categorizes the candidate proper nouns into their highest ranked concepts Result 4: an ontologically annotated web page.

Pankow

SemTag - Seeker IBM-developed ~264 million web pages ~72 thousand of concepts (TAP taxonomy) 434 million automatically disambiguated semantic tags

Spotting pass Documents are retrieved from the Seeker store, and tokenized Tokens are matched against the TAP concepts. Each resulting label is saved with ten words to either side as a ``window'' of

context around the particular candidate object. Learning pass

A representative sample of the data is scanned to determine the corpus-wide distribution of terms at each internal node of the taxonomy. TBD (taxonomy-based disambiguation) algorithm is used.

Tagging pass “Windows” are scanned once more to disambiguate each reference

determine an TAP object A record is entered into a database of final results containing the URL, the

reference, and any other associated metadata.

Conclusions Web-document A is a necessary thing O-based A benefits (O-based post-processing, unified

vocabularies, etc.) Manual A is a bad thing Automatic A is a good thing:

Supervised O-based A: Useful O-based interface for annotating training set Traditional IE tools for textual classification

Unsupervised O-based A: COHSE – matches concept names from the ontology and a

thesaurus against tokens from the text Pankow – uses ontology to build candidate queries, then uses

community wisdom to choose the best candidate SemTag – uses concept names to match tokens and hierarchical

relations in the ontology to disambiguate between candidate concepts for a text fragment

?

??

Questions

ontology-based annotation sergey sosnovsky @paws@sis@pitt

Documents

definition obased annotation

obased services omapping

type slide

pitt slide

rest of sw knowledge

annotated documents

h onto

information knowledge