semantics-empowered text exploration for knowledge discovery delroy cameron, pablo n. mendes, amit...

Post on 12-Jan-2016

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Semantics-Empowered Text Exploration for Knowledge Discovery

Delroy Cameron, Pablo N. Mendes, Amit P. ShethKnowledge Enabled Information and Services Science Center (Kno.e.sis)

Department of Computer Science and EngineeringWright State University

Dayton, OH

Victor ChanDivision of Biosciences and Performance

Human Effectiveness DirectorateAir Force Research Lab (AFRL)Wright-Patterson Air Force Base

Dayton, OH

48th ACM Southeast Conference. ACMSE 2010. Oxford, Mississippi. April 15-17, 2010.

OUTLINE

Background Paradigm Shift Demo Architecture Experimental Results Future Work Conclusion

3

BACKGROUND

IR Systems - Interaction Paradigm Manually seek information Hyperlinked Documents Document-Centric Model

Basis - Interaction Paradigm Keyword Search Document Browsing

4

SBACKGROUND

Interaction Sequence 1. Assemble Keywords and Search2. Document Selection3. Document Inspection 4. Aggregation/Organization

5

Information Need

What is the role of Magnesium in relation

to Migraine?

Magnesium migraine Search

LIMITATIONS

Query Reformulations Impatient users Recognition over Recall

Constrained navigation Hyperlink dependent - apriori

Fuzzy User Interests Haiti Earthquake – Recovery, Relief, Political Climate, Crime

Ineffective for Exploratory Search Search-and-Sift

Query: Father of the WebAnswer: Sir Tim Berners-Lee

Amit P. Sheth, Cartic Ramakrishnan: Relationship Web: Blazing Semantic Trails between Web Resources. IEEE Internet Computing 11(4): 77-81 (2007)

MOTIVATION

Users are A priori hyperlink dependent

Semantic Web Standards Entity Identification (Semantic Annotations) Relationship and Triple Identification Explore documents/information via relationships

information seekersInformation documentsis embedded in

7

PARADIGM SHIFT

Search Hit > Annotated Hit Bag of annotated words/phrases Annotated phrase is known entity Entity is Subject/Object of Triple

Navigation driven by relationships Entity[Document]Entity[Document] RelationshipRelationship Entity[Document]Entity[Document]

Contextual Navigation (relationships as context)

8

CONTRIBUTIONS

1. Novel Information Exploration Paradigm Data-Centric Model

2. Demonstrate use of background knowledge Named Entities, Relationships

3. Prototype Implementation Semantic annotations for navigation

4. Aggregation Utilities Saving, bookmarking, publishing etc

9

DEMO

10

Trie-based Spotter for Named Entity Identification used ultimately for document annotation

Semantic BrowserSemantic Browser

Controlled Vocabulary992,281 DBpedia terms

15,742 HPCO terms5,232 UMLS terms

Controlled Vocabulary992,281 DBpedia terms

15,742 HPCO terms5,232 UMLS terms

Medline(19 million Abstracts)

Medline(19 million Abstracts)

Spotter Module

Document CorpusLinked Open

Data

Save PublishOrganize

Utilities provided for promoting, bookmarking, and saving search results

Annotated entities provide anchors that serve as entry points to navigation

Semantic Trail Log

Sequential record of each triple navigated by a user

Yahoo (indexed documents

accessed as a Web Service using Yahoo Search Boss)

Yahoo (indexed documents

accessed as a Web Service using Yahoo Search Boss)

Articles saved using Lucene. Indexed as of Aug. 2009

Figure 1: System Components and Architecture

ARCHITECTURE

1

2

1

2

3

4

3

4

5

6

7

8

Background Knowledge

HCPO Ontology

UMLS

IMPLEMENTATION

Spotter Module <abstract>

Dietary restriction with hypomagnesia is normally associated with diminished urinary excretion. </abstract>

magnesium

magnesium

UMLS Controlled Vocabulary

Entity Label PubMed ID

Magnesium Deficiency

C0024473

Dietary restriction with hypomagnesia

C0024467Magnesium

EntityID:

This process is called Spotting and uses a Trie data structure.

12

magnesium

ARCHITECTURE

Document Corpus Medline Lucene Index - 19 million abstracts Aug 2009. REST Endpoint: http://knoesis1.wright.edu/IndexWrapper XML Response (or JSON) Keyword queries, Document IDs

Background Knowledge UMLS (Unified Medical Language System)

5,232 entities and 16,540 triples HPCO (Human Performance & Cognition Ontology)

15,742 entities and 22,298 triples

13

• Rank Feature on [1-5] scale• Normalized Relative Aggregated Scores

EVALUATION

Evaluation MetricsSearch User Interfaces

Semantic Browser (Medline + UMLS) PubMed Yahoo

Interface Design 0.93 0.88 1.00

Useful Features 1.00 0.67 0.65

Motivation to Explore 1.00 0.58 0.65

Information Novelty 1.00 0.76 0.79

Effectiveness of Task outcome 1.00 0.65 0.80

Required Cognitive Load 1.00 0.60 0.64

Overall Satisfaction 1.00 0.62 0.78

14

CONCLUSION

Novel Information Exploration Paradigm

Semantic Browser support Contextual Navigation

Identify Named Entities and Relationships

Provide Semantic Annotations

Utilities for Aggregation

Semantic Trails to Knowledge Discovery

15

x

• Formal Model for Paradigm Shift

• Improved Spotter– Additional Vocabularies, Context, Rule Based

• Relationship Ranking

• Document Re-ranking

• Trail Logs Analysis

FUTURE WORK

16

ACKNOWLEDGEMENTS

People Cartic Ramakrishnan Bilal Gonen, Aditya Dhoke Wesley Workman, Rodrigo Gama, Guilherme de Napoli

Air Force Research Lab Human Effectiveness Directorate Wright-Patterson Air Force Base

National Science Foundation Award SemDis: Discovering Complex Relationships in the Semantic Web.No. 071441 Wright State UniversityNo. IIS-0325464 to University of Georgia

17

QUESTIONS

18

• Semantic Web Semantic Web – is an extension of the current web extension of the current web in which data is expressed in a common vocabularycommon vocabulary making such that the data becomes machine processablemachine processable.

• OntologyOntology – is a specification of conceptsconcepts and relationshipsrelationships between them.

• TripleTriple - a ternary relation containing an entity pair and a relationship that expresses the link between them i.e. subject-predicate-objectsubject-predicate-object

• Entity/ConceptEntity/Concept – an instance of a thingthing

• URIURI – a unique identifier for any resource/entity/thing on the web

• LODLOD - a semantic web initiative to provide a repository of semantically connected datasets

TERMINOLOGY

19

top related