wwt: a system for query-driven relation extraction from ...sunita/talks/akbc.pdf · potentials...

45
WWT: A system for query-driven relation extraction from the semi- structured web Sunita Sarawagi Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani Soumen Chakrabarti IIT Bombay

Upload: others

Post on 10-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

WWT: A system for query-driven relation extraction from the semi-

structured web

Sunita Sarawagi

Rahul Gupta Girija Limaye Prashant Borole

Rakesh Pimplikar Aditya Somani Soumen Chakrabarti

IIT Bombay

Page 2: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

The semi-structured web

2

Table

List

Regular page

Formatted list

Page 3: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Queries in WWT

Query by example

Query by description

3

Alan Turing Turing Machine

E. F. Codd Relational Databases

Inventor Computer science concept

Page 4: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

4

Answer: Table with ranked rows

Inventor Computer Science Concept

Alan Turing Turing Machine

Seymour Cray Supercomputer

E. F. Codd Relational Databases

Tim Berners-Lee WWW

Charles Babbage Babbage Engine

Page 5: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

5

WWT Architecture

Index Query Builder

Web

Extract record sources

Query Table

Content+context index

Offline

Store

Ke

yw

ord

Qu

ery

So

urc

e L

1,…

,Lk

Type Inference

Resolver

Resolver builder

Typesystem Hierarchy

Extractor

Record labeler

CRF modelsConsolidator

Tables T1,…,Tk

Consolidated TableCell resolver Row resolver

Ranker

Row and cell scores

Final consolidated table

User

Annotate

Ontology

Page 6: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

6

Query By Example

Firefox Mitchell Gant 1982

… … …

… … …

Gran Torino Walt Kowalski 2008

Dirty Harry Harry Callahan 1971

City Heat - 1984

… - …

… - …

Joe Kidd Joe Kidd 1972

… … …

… … …

Merge & de-duplicate, Rank, Display to the user

User

Extract

Page 7: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Query-time extraction

Two differences from classical IE

1. Supervision: limited and indirect

Need robust techniques for labeling unstructured lists from Q

Our solution:

1. Similarity scores between a segment in a list and a cell in Q

2. Segment list to maximize similarity (Gupta, VLDB 2009)

2. Multiple sources with overlapping text segments

7

Page 8: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

8

Content Overlap

Segment Overlap: Partial, across varying number of sources

Page 9: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Options for exploiting overlap

Collective inference

Only helps instances with overlap

Bootstrapping or staged training

Danger of error propagation or drift.

9

Page 10: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

10

Collective Training

Page 11: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

11

Goal

Page 12: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

12

Goal

logX

yA

SY

i=1

pi(yAjwi)

Neither convex nor concave in the weightsIntractable because of exponential summations

Page 13: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

14

Matthew Matt Groening Simpsons

,The

: .1987

FOX - , 23rd

Emmy winner (creator)

“Fused” Graph: Fuse shared segments, add potentials

1

2

3

Agreement Term = Log Partition1987 Matthew “Matt” Groening : Simpsons .

FOX – Matthew “Matt” Groening , The Simpsons , 23rd

Emmy winner Matt Groening , The Simpsons (creator)

Four Cliques:• Matthew “Matt” Groening (1,2), • Matt Groening (1,2,3), • Matt Groening , The Simpsons (2,3), • Simpsons (1,2,3)

Page 14: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

15

Agreement Term = Log Partition

Page 15: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

16

Approximating the Agreement Term

Fused graph can be arbitrarily complex

Approximating the log-partition function

Belief propagation (BP)/TRW-S on the fused graph

Approximate BP tailored to chains (Liang et.al. ’09)

Two key problems

We get pseudo-marginals instead of marginalsA few wrong cliques => Completely different fused graph!

Page 16: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

17

Decomposing the Agreement Term

Idea: Fused graph = Sum of small but easier graphs

T varies over low tree-width components that span G

Page 17: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

18

1

2

3

2

3

Simpsons

Matt

Groening

,

The

Simpsons

Matthew Groening

Matt

GroeningMatt1

2

3

Clique-based Decomposition

A component = One clique + Chains incident to the clique

Each component is a tree, so exact logZ and gradient easy

1

2

Page 18: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

19

Experiments: Structured Queries

Firefox Mitchell Gant 1982

… … …

… … …

Gran Torino Walt Kowalski 2008

Dirty Harry Harry Callahan 1971

City Heat - 1984

… - …

… - …

Joe Kidd Joe Kidd 1972

… … …

… … …

Merge & de-duplicate, Rank, Display to the user

CollectiveExtractionUser

Page 19: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

20

Experiments

Aim: Reconstruct Wikipedia tables from only a few sample rows.

Sample queries

TV Series: Character name, Actor name, Season

Oil spills: Tanker, Region, Time

Golden Globe Awards: Actor, Movie, Year

Dadasaheb Phalke Awards: Person, Year

Parrots: common name, scientific name, family

Page 20: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

21

Experimental Setting

IE on 58 datasets

Each dataset = 2-20 HTML list sources from a 500M crawl

Wide range of #labels, #sources, #records, #cliques, base accuracy, noise

Handful (~3) labeled records per list source

F1 measured using manually annotated ground truth

Datasets binned by Base model F1 and Average Clique Incidence for ease of presentation

Page 21: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

22

Experiments: Other Frameworks

Label transfer cascade-prone; 10% drop for some datasets

Collective inference better, boosts 83.3% to 86.1%

Joint training best, boosts to 87.5%

Even with 7 training records, boosts F1 from 87.4% to 89.2%

Page 22: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

23

Experiments: Approximations

Decomposition methods >> approximation methods

Exact gradient computation the key reason

Clique decomposition also more tolerant to clique-noise

Page 23: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

24

Collective Training: Take homes

Collective training ideal for query-time extractions

Supervision limited

Redundancy and overlap abundant

Challenges

Arbitrary overlap pattern Intractable likelihoods

Tractable decompositions >> approximations to intractable objectives

Page 24: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Queries in WWT

Query by content

Query by description

25

Alan Turing Turing Machine

E. F. Codd Relational Databases

Inventor Computer science concept

Chemical Elements Atomic Number

Page 25: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Extraction: Description queries

Lithium 3

Sodium 11

Beryllium 4

Non-informative headers

No headers

Page 26: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Accessing relevant tables Ontological annotations

Combine clues from

Text around tables

Headers

Ontology labels when present

27

Chemical_elements

Metals Non_Metals

Alkali Gas

Aluminium

Lithium Hydrogen

All

People

Non alkali

Lithium 3

Sodium 11

Beryllium 4

Non-gas

CarbonSodium

Alkali

Chemical element

Page 27: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

28

Annotating Tables with Ontological links

Page 28: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Physicist

Person

Entity Typehierarchy

Entities

Catalog

B94 P22

The Time and Spaceof Uncle Albert

Albert Einstein

Book

Lemmas

Title Author

Uncle Albert and the Quantum Quest Russell Stannard

Relativity: The Special and the General Theory A. Einstein

B95

Uncle Albert and theQuantum Quest

Writes(Book,Person)bornAt(Person,Place)leader(Person,Country)

Type label

Relation label

B41

Relativity: The Special…

Entity label

Entity, Type, and Relation links

A Doxiadis.Uncle Petros and the Goldback conjecture

Page 29: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Challenges Ambiguity of entity names

“Hydrogen” both a chemical element and a place name

Noisy mentions of entity names

A. Einstein Vs Albert Einstein

Multiple labels

YAGO Ontology has average 2.2 types per entity

Missing type links in Ontology cannot use least common ancestor

Universities in Toronto Universities in Ontario.

Satyajit Ray Indian film directors

30

Page 30: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Collective labeling via graphical models

Variables

erc = Entity label in row r column c

tc = Type label of column c

bcc’ = Relation between columns c and c’

Potentials

Entity

Similarity between cell (r, c) in table and lemmas of entity erc in catalog

Type

Similarity between header & context of column c in table and lemmas of type tc in catalog

The specificity of a type tc: 1/|Entities in tc | 31

Á1(r; c; erc) = exp¡w>1 f1(r; c; erc)

¢:

Á2(c; tc) = exp¡w>2 f2(c; tc)

¢

Page 31: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Potentials (continued)

Entity-Type

1. erc has path to tc: (Einstein, Physicist)

Inverse distance between them

Penalizes over-generalization

2. erc has no path to tc: (Julius Plucker, German Physicist)

Fraction of erc’s siblings with tc

Julius Plucker is a Mathematician, many mathematicians are physicists

Newton is an English Physicist, overlap with German Physicists zero.

3. erc not in the catalog

Constant.

Allows NA label for columns with many unmatched entities

32

Á3(tc; erc) = exp¡w>3 f3(tc; erc)

¢

Page 32: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Potentials (continued)

Relation-Type-Type

Frequency of occurrence of this relationship in the catalog

Relation-Entity-Entity

1 if this triple exists in the catalog

-1: if the catalog refutes this triple

0: if the catalog is neutral

33

Á4(bcc0 ; tc; tc0) = exp¡w>4 f4(bcc0 ; tc; tc0)

¢

Á5(bcc0 ; erc; erc0) = exp¡w>5 f4(bcc0 ; erc; erc0)

¢

Page 33: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Inference

Exact: NP-hard, Belief propagation on factor graph 34

e31

e13

e23

e32

e22

e12

e33

t1 t3t2

e11

e21

b12 b13 b23

Φ2(1,t1)

Φ1(1,1,e11)

Φ3(t1, e31)

Φ4(b23, t2, t3)

Φ5(b23, e32, e33)

Page 34: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Accuracy of joint labeling

Dataset

Manually labeled

450 tables spanning general Web and Wikipedia

Automatically labeled

650 tables from Wikipedia where cells have entity links

35

00.10.20.30.40.50.60.70.80.9

Entity Types relations Entity

Manual Automatic

F1 A

ccu

rac

y

LCA

Majority

Ours

Page 35: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Impact on query accuracy

Movies: directed-by person=“George Clooney”

Countries: hasOfficial language=“Spanish”

Workload:

Five relations,

40 queries per relation

Ground truth: DBPedia

36

Given inputs R; T1; T2; E2 2+ T2, return all E1 2+ T1 such that

R(E1; E2) holds.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

actedIn directed language produced wrote

MA

P A

ccu

racy

Relationship Name

Baseline

Type

Type+Rels

Page 36: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

37

Summary

Amazing amount of quality information on the semi-structured web.

WWT

Online: structure interpretation at query time

Domain-independent methods for extraction and coreference resolution

Relies heavily on unsupervised statistical learning

Joint training for exploiting overlap during extraction

Graphical model for table annotation

Collective column labeling for descriptive queries

Bayesian network for consolidation

Page rank + confidence from a probabilistic extractor for ranking

Page 37: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Thank you.

38

Page 38: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Summary

Enriched organically created web tables with precious schema information from reference catalogs

Joint graphical model combines

Local text similarity

Global type and relationship consistency

Better response to attribute-value queries

Ongoing work:

Augmenting catalog with collective information from web tables.

39

Page 39: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

What next?

Designing plans for non-trivial ways of combining of sources

Better ranking and user-interaction models.

Expanding query set

Aggregate queries: tables are rich in quantities

Point queries: attribute value and relationship queries

Interplay between semi-structured web & Ontologies

Augmenting one with the other.

Quantify information in structured sources vis-à-vis text sources on typical query workloads

40

Page 40: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

41

Collective Training: Related Work Agreement based learning (Liang et.al. ’09)

EM-based scheme applied on two sources with clean cliques

Posterior Regularization (Ganchev et.al. ’08)

Different agreement term; Used in multi-view

Two-view {perceptron, regression}, co-{training, boosting, SVMs} (Brefeld et.al. ’05, Blum & Mitchell ’98, Collins & Singer ’99, Sindhwani et.al. ’05, Kakade & Foster ’07)

Two source and/or hard label transfer

Multi-task learning (Ando & Zhang ’05)

Single source, shared features sought

Semi-supervised learning (Chapelle et.al. ’06)

No training, no support for partially structured overlaps

Co-regularization, Pooling (Suzuki et.al. ’07)

Page 41: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

42

Annotating to an Ontology

Annotate table cells with entity nodes and table columns with type nodes

movies

Indian_films English_films

2008_films Terrorism_films

A_Wednesday

Black&White

Coffee_house (film)

Wednesday

All

People

Entertainers

Coffee_house (Loc)

Indian_films

2008_films

Indian_directors

Coffee_house (film)

Black&White

A_Wednesday

Page 42: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

44

Overall performance

Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list

=> no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates.

WWT has > 55% recall, beats others. Gain bigger for difficult queries.

All Queries Difficult Queries

Page 43: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

45

Running time

< 30 seconds with 3 query records.

Can be improved by processing sources in parallel.

Variance high because time depends on number of columns, record length etc.

Page 44: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

46

Re-writing the Agreement Term

Labeling of chain i

Page 45: WWT: A system for query-driven relation extraction from ...sunita/talks/akbc.pdf · Potentials (continued) Entity-Type 1. e rc has path to t c: (Einstein, Physicist) Inverse distance

Physicist

Person

Entity Typehierarchy

Entities

Catalog

B94 P22

The Time and Spaceof Uncle Albert

Albert Einstein

Book

Lemmas

Title Author

Uncle Albert and the Quantum Quest Russell Stannard

Relativity: The Special and the General Theory A. Einstein

B95

Uncle Albert and theQuantum Quest

Writes(Book,Person)bornAt(Person,Place)leader(Person,Country)

Type label

Relation label

B41

Relativity: The Special…

Entity label

A Doxiadis.Uncle Petros and the Goldbach conjecture