ester efficient search on text, entities, and relations holger bast max-planck-institut für...

39
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Fabian Suchanek, Ingmar Weber Talk at SIGIR’07 in Amsterdam, July 26th

Upload: bonnie-hubbard

Post on 03-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

ESTEREfficient Search on Text, Entities, and Relations

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with

Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR’07 in Amsterdam, July 26th

Page 2: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

ESTEREfficient Search on Text, Entities, and Relations

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with

Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR’07 in Amsterdam, July 26th

Page 3: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with

Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR’07 in Amsterdam, July 26th

ESTERIt’s about:

Fast Semantic Search

Page 4: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Keyword Search vs. Semantic Search

Keyword search

– Query: john lennon

– Answer: documents containing the words john and lennon

Semantic search

– Query: musician

– Answer: documents containing an instance of musician

Combined search

– Query: beatles musician

– Answer: documents containing the word beatles and an instance of musicianUseful by itself or as a component of a QA system

Page 5: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 6: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 7: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 8: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 9: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 10: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 11: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Semantic Search: Challenges + Our System

1. Entity recognition– approach 1: let users annotate (semantic web)

– approach 2: annotate (semi-)automatically

– our system: uses Wikipedia links + learns from them

2. Query Processing– build a space-efficient index

– which enables fast query answers

– our system: as compact and fast as a standard full-text engine

3. User Interface– easy to use

– yet powerful query capabilities

– our system: standard interface with interactive suggestions

Page 12: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Semantic Search: Challenges + Our System

1. Entity recognition– approach 1: let users annotate (semantic web)

– approach 2: annotate (semi-)automatically

– our system: uses Wikipedia links + learns from them

2. Query Processing– build a space-efficient index

– which enables fast query answers

– our system: as compact and fast as a standard full-text engine

3. User Interface– easy to use

– yet powerful query capabilities

– our system: standard interface with interactive suggestions

focus of the paperand of this talk

Page 13: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

In the Rest of this Talk …

Efficiency

– three simple ideas (which all fail)

– our approach (which works)

Queries supported

– essentially all SPARQL queries, and

– seamless integration with ordinary full-text search

Experiments

– efficiency (great)

– quality (not so great yet)

Conclusions

– lots of interesting + challenging open problems

Page 14: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Efficiency: Simple Idea 1

Add “semantic tags” to the document

– e.g., add the special word tag:musician before every occurrence of a musician in a document

Problem 1: Index blowup

– e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes)

Problem 2: Limited querying capabilities

– e.g., could not produce list of musicians that occur in documents that also contain the word beatles

– i.p., could not do all SPARQL queries (more on that later)

Page 15: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Efficiency: Simple Idea 2

Query Expansion

– e.g., replace query word musician by disjunction

musician:aaron_copland OR … OR musician:zarah_leander

(7,593 musicians in Wikipedia)

Problem: Inefficient query processing

– one intersection per element of the disjunction needed

Page 16: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Efficiency: Simple Idea 3

Use a database

– map semantic queries to SQL queries on suitably constructed tables

– that’s what the Artificial-Intelligence / Semantic-Web people usually do

Problem: Inefficient + Lack of control

– building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both

– very limited control regarding efficiency aspects

Page 17: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Efficiency: Our Approach

Two basic operations

– prefix search of a special kind [will be explained by example]

– join [will be explained by example]

An index data structure

– which supports these two operations efficiently

Artificial words in the documents

– such that a large class of semantic queries reduces to a combination of (few of) these operations

Page 18: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

entity:john_lennonentity:1964entity:liverpooletc.

entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.

entity:john_lennonetc.

twoprefix

queries

onejoin

position

beatles entity:* entity:* . relation:is_a .

class:musician

Page 19: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)

Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

position

beatles entity:* entity:* . relation:is_a .

class:musician

Page 20: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

artist:john_lennonartist:graham_greeneartist:pete_bestetc.

artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.

artist:john_lennonetc.

position

beatles artist:* artist:* . relation:is_a .

class:musiciantwoprefix

queries

onejoin

first figure out:musician artist

(easy)

Page 21: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Maintains lists for word ranges (not words)

Looks like this for person:*

abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …

Pos. 5 Pos. 14 Pos. 124 Pos. 88 …

Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …

able ablaze abroad abnormal

person:* Doc. 17 Doc. 23 Doc. 72 Doc. 72 …

Pos. 12 Pos. 3 Pos. 55 Pos. 59 …

Scor. 0.1 Scor. 0.5 Scor. 0.3 Scor. 0.5 …

person:john_lenno

nperson:ringo_starr

person:graham_gree

neperson:john_lenno

n

The HYB Index [Bast/Weber, SIGIR’06]

Page 22: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

The HYB Index [Bast/Weber, SIGIR’06]

Maintains lists for word ranges (not words)

Provably efficient

– no more space than an inverted index (on the same data)

– each query = scan of a moderate number of (compressed) items

abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …

Pos. 5 Pos. 14 Pos. 124 Pos. 88 …

Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …

able ablaze abroad abnormal

Extremely versatile

– can do all kinds of things an inverted index cannot do (efficiently)

– autocompletion, faceted search, query expansion, errorcorrection, select and join, …

Page 23: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

SPARQL Protocol And

RDF Query Language

(yes, it’s recursive)

Queries we can handle

We prove the following theorem:

– Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations

SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?when John_Lennon born_in_year ?when }

ESTER achieves seamless integration with full-text search

– SPARQL has no means for dealing with full text search

– XQuery can handle full-text search, but is not really suitable for semantic search

musicians bornin the same yearas John Lennon

more about supported queries in the paper

Page 24: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Corpus, Ontology, Index

Corpus: English Wikipedia (xml dump from Nov. 2006)

≈ 8 GB raw xml

≈ 2,8 million documents

≈ 1 billion words

Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07)

≈ 2,5 million facts

derived from clever combination of Wikipedia + WordNet (Entities from Wikipedia, Taxonomy from WordNet)

Our Index

≈ 1.5 billion words (original + artificial)

≈ 3.3 GB total index size; ontology-only is a mere 100 MB

Note: our system works for an arbitrary corpus + ontology

Page 25: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Efficiency — What Baseline?

SPARQL engines – can’t do text search

– and slow for ontology-only too (on Wikipedia: seconds)

XQuery engines – extremely slow for text search (on Wikipedia: minutes)

– and slow for ontology-only too (on Wikipedia: seconds)

Other prototypes which do semantic + full-text search– efficiency is hardly considered

– e.g., the system of Castells/Fernandez/Vallet (TKDE’07)

“… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …”

– our system: ~100ms, 2.8 million documents, 2.5 million facts

Page 26: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Efficiency — Stress Test 1

Compare to ontology-only system

– the YAGO engine from WWW’07

– Onto Simple : when was [person] born [1000 queries]

– Onto Advanced: list all people from [profession] [1000 queries]

– Onto Hard : when did people die who were born in the same year as [person] [1000 queries]

Note: comparison very unfair (for our system)

Our system Onto-Only

avg. max. avg. max.

Onto Simple 2 ms 5 ms 3 ms 20 ms

Onto Advanced 9 ms 31 ms 3 ms794 ms

Onto Hard64 ms

208 ms

78 ms

550 ms

100 MB index

4 GB index

Page 27: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Efficiency — Stress Test 2

Our system Full-Text Only

avg. max. avg. max.

Onto+Text Easy224 ms

772 ms 90 ms 498 ms

Onto+Text Hard279 ms

502 ms 44 ms 85 ms

Compare to text-only search engine

– state-of-the-art system from SIGIR’06

– Onto+Text Easy: counties in [US state] [50 queries]

– Onto+Text Hard: computer scientists [nationality] [50 queries]

– Full-text query: e.g. german computer scientists Note: hardly finds relevant documents

Note: comparison extremely unfair (for our system)

Page 28: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Quality — Entity Recognition

Use Wikipedia links as hints

– “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …”

– “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …”

Learn other links

– use words in neighborhood as features

Accuracy

all words 2 senses 3 senses ≥4 senses

93.4% 88.2% 84.4% 80.3%

Page 29: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Experiments: Quality — Relevance

2 Query Sets

– People associated with [american university] [100 queries]

– Counties of [american state] [50 queries]

Ground truth

– Wikipedia has corresponding lists

e.g., List of Carnegie Mellon University People

Precision and Recallprecision@1

0recall

PEOPLE 37.3% 89.7%

COUNTIES 66.5% 97.8%

Page 30: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Conclusions

Semantic Retrieval System ESTER

– fast and scalable via reduction to prefix search and join

– can handle all basic SPARQL queries

– seamless integration with full-text search

– standard user interface with (semantic) suggestions

Lots of interesting and challenging problems

– simultaneous ranking of entities and documents

– proper snippet generation and highlighting

– search result quality

– … Dank je wel!

Page 31: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 32: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Page 33: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Context-Sensitive Prefix-Search

Compute completions of last query word

– which together with the previous part of the query would lead to a hit

– [DEMO: show a live example]

Extremely useful

– autocompletion search

– faceted search

– error correction, synonym search, …

– category search

for example, add place:amsterdam

then query place:* finds all instances of a place

formal definitionin the paper

Isn’t the last idea enough for semantic search?

Page 34: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

DEMO

Do the following queries [live or recorded]

– beatles

– beatles musi

– beatles musicia

– beatles musician:john_lennon (or beatles entity:john_lennon)

Page 35: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Liverpool[one of many documents mentioning John Lennon]

… in honor of the late Beatle entity:john_lennon

Liverpool[one of many documents mentioning John Lennon]

… in honor of the late Beatle entity:john_lennon

John Lennon

0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …

John Lennon

0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …

beatles entity:* “entity:* r:is_a class:musician”

position

Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia = 20% of all occurrences– prefix search efficient only up XXX

Solution: Frontier set– classes high up in the hierarchy [explain more]– e.g.: person, animal, substance, abstraction, …

Page 36: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Liverpool[one of many documents mentioning John Lennon]

… in honour of the late Beatle person:john_lennon

Liverpool[one of many documents mentioning John Lennon]

… in honour of the late Beatle person:john_lennon

John Lennon

0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …

John Lennon

0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …

beatles person:*

person:john_lennonperson:the_queenperson:pete_bestetc.

“person:* r:is_a class:musician”

person:wolfang_amadeus_mozartperson:johann_sebastian_bachperson:john_lennonetc.

entity:john_lennon etc.

position

twoprefix

queries

one join

Page 37: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Our Solution, Version 1

Combination of Prefix Search + Join

– Query 1: beatles entity:* entities co-occuring with beatles

– Query 2: musician – entity:* entities which are musicians

– Join the completion from 1 & 2 musicians co-occuring with beatles

Some document about Albert

Einstein

… entity:einstein …

Some document about Albert

Einstein

… entity:einstein …

Albert Einstein

entity:albert_einsteinscientistvegetarianintellectual …

Albert Einstein

entity:albert_einsteinscientistvegetarianintellectual …

But: unspecific prefixes (entity:*) are hard

Page 38: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Our Solution, Version 2

Combination of Prefix Search + Join

– Query 1: translate:singer:* tells us that a singer is a musician

– Query 2: beatles musician:* musicians co-occurring with beatles

– Query 3: physicist – scientist:* musicians which are singers

– Join the completion from 1 & 2 singers co-occurring with beatles

Some document mentioning

John Lennon

… musician:john_lennon xyz:john_lennon …

Some document mentioning

John Lennon

… musician:john_lennon xyz:john_lennon …

John Lennon

musician:john_lennonxyz:john_lennon …

John Lennon

musician:john_lennonxyz:john_lennon …

[Special Doc]

TRANSLATE:singer:musician

[Special Doc]

TRANSLATE:singer:musician

Page 39: ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

artist:john_lennonartist:queen_elisabethartist:pete_bestetc.

artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.

person:john_lennonetc.

position

beatles artist:* artist:* . relation:is_a .

class:musiciantwoprefix

queries

onejoin

John Lennon at the Royal Variety Show in 1963, in the presence of members of the British royalty:

"Those of you in the cheaper seats can clap your hands. The rest of you, if you'll just rattle your jewellery."