structured querying of web text a technical challenge

43
Structured Querying of Web Structured Querying of Web Text Text A Technical Challenge A Technical Challenge Kulsawasd Jitkajornwanich Kulsawasd Jitkajornwanich University of Texas at Arlington University of Texas at Arlington [email protected] [email protected] CSE6339 Web Mining | April 16, 2009 | CSE6339 Web Mining | April 16, 2009 | 9:30 am 9:30 am by Cafarella, Re’, Suciu, Etzioni & Banko

Upload: zoey

Post on 30-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Structured Querying of Web Text A Technical Challenge. by Cafarella, Re’, Suciu, Etzioni & Banko. Kulsawasd Jitkajornwanich University of Texas at Arlington [email protected]. CSE6339 Web Mining | April 16, 2009 | 9:30 am. Introduction. What is structured-query ? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structured Querying of Web Text A Technical Challenge

Structured Querying of Web Structured Querying of Web TextText

A Technical ChallengeA Technical Challenge

Kulsawasd JitkajornwanichKulsawasd Jitkajornwanich

University of Texas at ArlingtonUniversity of Texas at Arlington

[email protected]@hotmail.com

CSE6339 Web Mining | April 16, 2009 | 9:30 amCSE6339 Web Mining | April 16, 2009 | 9:30 am

by Cafarella, Re’, Suciu, Etzioni & Banko

Page 2: Structured Querying of Web Text A Technical Challenge

IntroductionIntroduction

What is What is structured-querystructured-query?? 2 types of query: Structured-query & 2 types of query: Structured-query &

Unstructured-queryUnstructured-query 1. Structured-query1. Structured-query

Has “condition” in the query Has “condition” in the query Can make a complicated queryCan make a complicated query ex. ex. ““SQL query”SQL query”

List employee whose name start with ‘David’ and salary List employee whose name start with ‘David’ and salary > 5000> 5000

SELECTSELECT E.name, E.salary E.name, E.salary FROM FROM Employee EEmployee E WHERE WHERE E.name LIKE ‘David’, E. salary > 5000E.name LIKE ‘David’, E. salary > 5000

22

Page 3: Structured Querying of Web Text A Technical Challenge

IntroductionIntroduction

What is What is structured-querystructured-query?? 2. Unstructured-query2. Unstructured-query

ex. ex. ““Keyword Search”Keyword Search” no “condition” in the queryno “condition” in the query simply do “string matching”simply do “string matching”

33

Page 4: Structured Querying of Web Text A Technical Challenge

IntroductionIntroduction

--> we just talked about type of query --> we just talked about type of query <--<--

What about type of data?What about type of data? 2 types of data:2 types of data:

1. Structured-data1. Structured-data ex. Relational tablesex. Relational tables

2. Unstructured-data2. Unstructured-data ex. Web documentsex. Web documents

44

Page 5: Structured Querying of Web Text A Technical Challenge

IntroductionIntroduction Objective of the paper:Objective of the paper:

To propose a tool called To propose a tool called ExDBExDB to make a to make a structured-structured-queryquery on web documents on web documents (unstructured-data)(unstructured-data)

55

RelationalDatabase

Web Text

SQL Query

SQL Query

ExDB

Unstructured-query(Keyword Search)

Structured-query(Complicated query

like SQL-query)

Search Engine

Structured-data

Unstructured-data

Page 6: Structured Querying of Web Text A Technical Challenge

How it works: How it works: Big Picture ofBig Picture of ExDBExDB

66

Collection of web documents

ExDB Extractor

Fact Table

Type Table

Constraint TableUser

ExDB Complier

q(?x,?y):- invented(?x,?y)

RDBMS Database

Resulting Table

Page 7: Structured Querying of Web Text A Technical Challenge

How it works: How it works: Big Picture ofBig Picture of ExDBExDB

77

Collection of web documents

ExDB Extractor

Fact Table

Type Table

Constraint TableUser

ExDB Complier

q(?x,?y):- invented(?x,?y)

RDBMS Database

Resulting Table

Page 8: Structured Querying of Web Text A Technical Challenge

OutlineOutline

11stst Component Component: : ExDB ExtractorExDB Extractor What/How does it do in more detail?What/How does it do in more detail?

22ndnd Component: Component: ExDB CompilerExDB Compiler What/How does it do in more detail?What/How does it do in more detail?

Test your understanding!!Test your understanding!! Working on tasksWorking on tasks Compare result Compare result ExDBExDB & & GoogleGoogle ConclusionConclusion

88

Page 9: Structured Querying of Web Text A Technical Challenge

How How ExDBExDB WorksWorks

11stst Component: Component: ExDB ExtractorExDB Extractor What does it do?What does it do?

To To extract dataextract data from the from the web documentsweb documents & & put itput it into the into the tablestables

99

Page 10: Structured Querying of Web Text A Technical Challenge

How How ExDBExDB WorksWorks

22ndnd Component: Component: ExDB CompilerExDB Compiler What does it do?What does it do?

To To processprocess the user’s the user’s structured-query structured-query on on the tables from 1the tables from 1stst component ( component (ExDB ExDB ExtractorExtractor) and give the ) and give the resulting tableresulting table back to userback to user

ex. ex. q(?x, ?y):- invented(?x, ?y)q(?x, ?y):- invented(?x, ?y) <we will study this query syntax later on><we will study this query syntax later on>

1010

Page 11: Structured Querying of Web Text A Technical Challenge

How it works: How it works: Big Picture ofBig Picture of ExDBExDB

1111

RDBMS Database

Collection of web documents

…was surprising. In

1877, Edison

invented the light bulb. Although he

ExDB Extractor

Fact Table

Type Table

Constraint Table User: Make a Make a query query using using ExDBExDB syntaxsyntax

ExDB Complier

11stst Component: Component:

ExDB ExDB ExtractorExtractor

22ndnd Component: Component:

ExDB ExDB CompilerCompiler

Page 12: Structured Querying of Web Text A Technical Challenge

11stst Component: Component: ExDB ExtractorExDB Extractor

What does it do?What does it do? To To extract dataextract data from the from the web documentsweb documents & & put put

itit into the into the tablestables There are 3 tables:There are 3 tables:

1. Fact Table1. Fact Table 2. Type Table2. Type Table 3. Constraint Table3. Constraint Table

Additional column: stores tuple probabilityAdditional column: stores tuple probability Discussion:Discussion: Why do need this column?Why do need this column?

0<p<1, 0<p<1, p pi i = 1= 1 One way to assign probability: Counting occurrence One way to assign probability: Counting occurrence

frequencyfrequency Assume Assume IndependenceIndependence among tuples among tuples

1212

Page 13: Structured Querying of Web Text A Technical Challenge

1.1 Fact Table1.1 Fact Table Stores Stores fact informationfact information

ex. “Edison invented light bulb” ex. “Edison invented light bulb” Uses Uses TextRunnerTextRunner to extract to extract How is it look like?How is it look like?

1313

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbabilityProbability

inventedinvented EdisonEdison Light bulbLight bulb 0.750.75

died-indied-in EdisonEdison 18771877 0.550.55

…… …… …… ……

Fact TableFact Table

Probability= no of occurrence / no of predicate occurrences

11stst Component: Component: ExDB ExtractorExDB Extractor

Page 14: Structured Querying of Web Text A Technical Challenge

Example1: shows how to get Example1: shows how to get Fact tableFact table

1414

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbabilityProbability

InventedInvented EdisonEdison Light bulbLight bulb 0.750.75

InventedInvented EdisonEdison PhonographPhonograph 0.250.25

…… …… …… ……

Fact TableFact Table

…was surprising. In

1877, Edison

invented the light bulb. Although he

It was a big news when

Edison invented the light bulb.

…We all know that Edison invented light bulb.

…not only that Edison also invented the phonograph

.

Probability = no of occurrencesno of predicate

occurrences

Object

Predicate

TextRunnerTextRunnerTextRunnerTextRunner

Page 15: Structured Querying of Web Text A Technical Challenge

Discussion:Discussion: What do you think might be a problem with this What do you think might be a problem with this

design of fact table?design of fact table? Cannot support Ternary-predicate --> ex. David donates books to Child Organization.

1515

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbabilityProbability

inventedinvented EdisonEdison Light bulbLight bulb 0.750.75

died-indied-in EdisonEdison 18771877 0.550.55

…… …… …… ……

Fact TableFact Table

11stst Component: Component: ExDB ExtractorExDB Extractor

Page 16: Structured Querying of Web Text A Technical Challenge

1.2 Type Table1.2 Type Table Stores Stores object type informationobject type information

ex. Edison is a scientist.ex. Edison is a scientist. Uses Uses KnowItAllKnowItAll to extract to extract How is it look like?How is it look like?

1616

TypeType ObjectObject ProbabilityProbability

ScientistScientist EdisonEdison 0.730.73

CityCity BostonBoston 0.360.36

…… …… ……

Type TableType Table

Probability= no of occurrence / no of type occurences

11stst Component: Component: ExDB ExtractorExDB Extractor

Page 17: Structured Querying of Web Text A Technical Challenge

1717

TypeType ObjectObject ProbabilityProbability

scientistscientist EdisonEdison 0.750.75

ScientistScientist BenjaminBenjamin 0.250.25

…… …… ……

Type TableType Table

…As we know,

Edison is a scientist.

Although he …

… there are many world-

famous scientists

such as Edison,

…However, someone claim that

Benjamin is also an

scientist.

…scientists

such as Edison, …

Probability = no of occurrencesno of type

occurrences

Object Type

KnowItAllKnowItAll

Example2: shows how to get Example2: shows how to get Type tableType table

Page 18: Structured Querying of Web Text A Technical Challenge

1.3 Constraint Table1.3 Constraint Table Stores Stores constraint information constraint information of objects or of objects or

predicatespredicates There are 2 types of constraints discussed in There are 2 types of constraints discussed in

this paper: Synonym and Inclusion Dependencythis paper: Synonym and Inclusion Dependency Uses Uses DIRTDIRT to extract to extract

1. Synonym1. Synonym example for predicate: did-invented = example for predicate: did-invented = inventedinvented example for object: Edison T. =example for object: Edison T. = Edison Edison

2. Inclusion Dependency2. Inclusion Dependency example for predicate: be-guardian example for predicate: be-guardian be-parentbe-parent example for object: relative example for object: relative sistersister

1919

11stst Component: Component: ExDB ExtractorExDB Extractor

Page 19: Structured Querying of Web Text A Technical Challenge

example shows howexample shows how DIRT DIRT worksworks

for for SynonymSynonym constraint constraint

1111

…was surprising. In

1877, Edison

invented the light bulb. Although he

…Collection of web documents

DIRT

Thomas E.

Edison T.Thomas Edison

Thomas Edison

Page 20: Structured Querying of Web Text A Technical Challenge

example shows howexample shows how DIRT DIRT worksworks

for for Inclusion DependencyInclusion Dependency constraintconstraint

1111

…was surprising. In

1877, Edison

invented the light bulb. Although he

…Collection of web documents

DIRT

Be-parent

Be-guardian Be-babysitter

Page 21: Structured Querying of Web Text A Technical Challenge

1.3 Constraint Table1.3 Constraint Table How is it look like?How is it look like?

2020

ConstraintConstraint Object 1Object 1 Object 2Object 2 ProbabilityProbability

SynonymSynonym EdisonEdison T. EdisonT. Edison 0.750.75

Inclusion Inclusion DependencyDependency

Be-parentBe-parent Be-guardianBe-guardian 0.550.55

…… …… …… ……

Constraints TableConstraints Table

11stst Component: Component: ExDB ExtractorExDB Extractor

Superset

Subset

Page 22: Structured Querying of Web Text A Technical Challenge

Key point summary of 1Key point summary of 1stst component: component: (ExDB Extractor)(ExDB Extractor) 1. ExDB Extractor uses different kinds 1. ExDB Extractor uses different kinds of existing extractor: of existing extractor: TextRunnerTextRunner, , KnowItAllKnowItAll and and DIRTDIRT..

2. 2. Probabilistic columnProbabilistic column is used to is used to indicate the indicate the degree of correctnessdegree of correctness and and deal with deal with uncertainty problemuncertainty problem..

3. Drawback of fact table, only 3. Drawback of fact table, only Binary Binary PredicatePredicate is allowed. is allowed.

2222

11stst Component: Component: ExDB ExtractorExDB Extractor

Page 23: Structured Querying of Web Text A Technical Challenge

How it works: How it works: Big Picture ofBig Picture of ExDBExDB

2323

RDBMS Database

Collection of web documents

…was surprising. In

1877, Edison

invented the light bulb. Although he

ExDB Extractor

Fact Table

Type Table

Constraint Table User: Make a Make a query query using using ExDBExDB syntaxsyntax

ExDB Complier

11stst Component: Component:

ExDB ExDB ExtractorExtractor

22ndnd Component: Component:

ExDB ExDB CompilerCompiler

Page 24: Structured Querying of Web Text A Technical Challenge

What does it do?What does it do? To To processprocess the user’s the user’s structured-query structured-query

on the tables from 1on the tables from 1stst component ( component (ExDB ExDB ExtractorExtractor))

Result will be in Result will be in tabletable format and format and ranked ranked by highest probability valueby highest probability value..

ex. ex. q(?x, ?y):- invented(?x, ?y)q(?x, ?y):- invented(?x, ?y) However, users are not expected to However, users are not expected to

know the table schema.know the table schema.

2424

22ndnd Component: Component: ExDB CompilerExDB Compiler

Page 25: Structured Querying of Web Text A Technical Challenge

ExDBExDB syntax: syntax: ?x ?x = variable = variable xx w w = constant value = constant value ww q(?x,?y):- q(?x,?y):- = define resulting table = define resulting table qq consisting consisting of column of column xx and and yy

invented(?x,?y) invented(?x,?y) = return list of object x and y = return list of object x and y regarding predicate “invented”regarding predicate “invented”

invented(<scientists> ?x,?y)invented(<scientists> ?x,?y) = return list of = return list of object object xx whose type is whose type is <scientists><scientists> and and yy regarding predicate regarding predicate “invented”“invented”

This syntax is calledThis syntax is called “Datalog-like notation” “Datalog-like notation” Let’s try some examples!Let’s try some examples!

2525

22ndnd Component: Component: ExDB CompilerExDB Compiler

q(?x, ?y):-invented(?x, ?y)q(?x, ?y):-invented(?x, ?y)

Page 26: Structured Querying of Web Text A Technical Challenge

2626

Make a QueryMake a Query example:example:

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbabilityProbability

inventedinvented EdisonEdison Light bulbLight bulb 0.550.55

inventedinvented EdisonEdison TelescopeTelescope 0.140.14

inventedinvented EdisonEdison PhonographPhonograph 0.140.14

inventedinvented JasonJason Cell phoneCell phone 0.140.14

died-indied-in Mary T.Mary T. 18771877 0.050.05

Fact TableFact Table

example4:example4: list all inventions invented by Edisonlist all inventions invented by Edison

answeranswer:: q(?i):- invented(Edison, ?i)q(?i):- invented(Edison, ?i)

ii ProbabilityProbability

Light bulbLight bulb 0.550.55

TelescopeTelescope 0.140.14

PhonographPhonograph 0.140.14

q Tableq Table

Page 27: Structured Querying of Web Text A Technical Challenge

2727

Make a QueryMake a Query example:example:

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.700.70

died-indied-in EdisonEdison 19551955 0.400.40

inventedinvented David A.David A. GuitarGuitar 0.300.30

died-indied-in PeterPeter 19551955 0.200.20

died-indied-in Mary T.Mary T. 18001800 0.050.05

Fact TableFact Table

example5:example5: list all scientist died in 1955list all scientist died in 1955

TypeType ObjectObject ProbProb

scientistscientist EdisonEdison 0.500.50

scientistscientist PeterPeter 0.150.15

scientistscientist Mary T.Mary T. 0.150.15

scientistscientist David A.David A. 0.100.10

citycity BostonBoston 0.050.05

Type TableType Table

answer:answer: q(?i):- died-in(<scientist> ?i, 1955) q(?i):- died-in(<scientist> ?i, 1955)

Page 28: Structured Querying of Web Text A Technical Challenge

2828

Make a QueryMake a Query example:example:

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.700.70

died-indied-in EdisonEdison 19551955 0.400.40

inventedinvented David A.David A. GuitarGuitar 0.300.30

died-indied-in PeterPeter 19551955 0.200.20

died-indied-in Mary T.Mary T. 18001800 0.050.05

Fact TableFact Table

example5:example5: list all scientist died in 1955list all scientist died in 1955

TypeType ObjectObject ProbProb

scientistscientist EdisonEdison 0.500.50

scientistscientist PeterPeter 0.150.15

scientistscientist Mary T.Mary T. 0.150.15

scientistscientist David A.David A. 0.100.10

citycity BostonBoston 0.050.05

Type TableType Table

TypeType ObjectObject PredicatePredicate ObjectObject ProbProb

scientistscientist EdisonEdison died-indied-in 19551955 0.200.20

scientistscientist PeterPeter died-indied-in 19551955 0.030.03

Joining TableJoining Table 0.20 = 0.50 x 0.40 because we

assume independence among tuples;

i.e,P(t1, t2)=P(t1) *

P(t2)

?i?i ProbProb

EdisonEdison 0.200.20

PeterPeter 0.030.03

q Tableq Table

answer:answer: q(?i):- died-in(<scientist> ?i, 1955) q(?i):- died-in(<scientist> ?i, 1955)

Page 29: Structured Querying of Web Text A Technical Challenge

2929

Make a QueryMake a Query example:example:

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.700.70

died-indied-in EdisonEdison 19551955 0.400.40

inventedinvented David A.David A. GuitarGuitar 0.300.30

died-indied-in PeterPeter 19551955 0.200.20

died-indied-in Mary T.Mary T. 18001800 0.050.05

Fact TableFact Table

example6:example6: list all scientist who died after 1900, their list all scientist who died after 1900, their

inventions and year they diedinventions and year they died

TypeType ObjectObject ProbProb

scientistscientist EdisonEdison 0.500.50

scientistscientist PeterPeter 0.150.15

scientistscientist Mary T.Mary T. 0.150.15

scientistscientist David A.David A. 0.100.10

citycity BostonBoston 0.050.05

Type TableType Table

answeranswer:: q(?x, ?y, ?z):- invented(?x, ?y),q(?x, ?y, ?z):- invented(?x, ?y),

died-in(<scientist> ?x, ?z),died-in(<scientist> ?x, ?z), (z > 1900)(z > 1900)

Page 30: Structured Querying of Web Text A Technical Challenge

3030

Make a QueryMake a Query example:example:

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.700.70

died-indied-in EdisonEdison 19551955 0.400.40

inventedinvented David A.David A. GuitarGuitar 0.300.30

died-indied-in PeterPeter 19551955 0.200.20

died-indied-in Mary T.Mary T. 18001800 0.050.05

Fact TableFact Table

example6:example6: list all scientist who died after 1900, their list all scientist who died after 1900, their

inventions and year they diedinventions and year they died

TypeType ObjectObject ProbProb

scientistscientist EdisonEdison 0.500.50

scientistscientist PeterPeter 0.150.15

scientistscientist Mary T.Mary T. 0.150.15

scientistscientist David A.David A. 0.100.10

citycity BostonBoston 0.050.05

Type TableType Table

TypeType PredicatePredicate PredicatePredicate ObjectObject ObjectObject ObjectObject ProbProb

scientistscientist died-indied-in inventedinvented EdisonEdison 19551955 light light bulbbulb

0.140.14

Joining TableJoining Table

0.14 = 0.50 x 0.40 x 0.70

?x?x ?y?y ?z?z ProbProb

EdisonEdison Light Light bulbbulb

19551955 0.140.14

q Tableq Table

Page 31: Structured Querying of Web Text A Technical Challenge

3131

Test Your Understanding!Test Your Understanding!

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.700.70

playplay JohnJohn GuitarGuitar 0.400.40

inventedinvented David A.David A. GuitarGuitar 0.300.30

PlayPlay JacksonJackson PianoPiano 0.200.20

playplay JacksonJackson GuitarGuitar 0.050.05

Born-inBorn-in JohnJohn 19901990 0.050.05

Born-inBorn-in JacksonJackson 19801980 0.050.05

Born-inBorn-in BobbyBobby 19801980 0.050.05

Fact TableFact Table

Problem1:Problem1: list all singer who born in 1980, their instrumentslist all singer who born in 1980, their instruments

TypeType ObjectObject ProbProb

SingerSinger JohnJohn 0.500.50

instrumentinstrument GuitarGuitar 0.150.15

instrumentinstrument pianopiano 0.150.15

SingerSinger JacksonJackson 0.100.10

SingerSinger BobbyBobby 0.050.05

Type TableType Table

answeranswer:: q(?x, ?y):- q(?x, ?y):- play(<singer> ?x,<instrument> ?y),play(<singer> ?x,<instrument> ?y),

born-in(<singer> ?x, 1980)born-in(<singer> ?x, 1980)

Page 32: Structured Querying of Web Text A Technical Challenge

3232

Test Your Understanding!Test Your Understanding!

PredicatePredicate Object 1Object 1 Object 2Object 2 ProbProb

Being-producerBeing-producer MattMatt BobbyBobby 0.700.70

Being-producerBeing-producer MattMatt JacksonJackson 0.400.40

Has-incomeHas-income BobbyBobby 25002500 0.300.30

Has-incomeHas-income JacksonJackson 30003000 0.200.20

Has-incomeHas-income MattMatt 20002000 0.050.05

Being-producerBeing-producer MattMatt JohnJohn 0.050.05

Has-incomeHas-income JohnJohn 10001000 0.050.05

Fact TableFact Table

Problem2:Problem2: list all singer who has income more than their list all singer who has income more than their

producerproducer

TypeType ObjectObject ProbProb

SingerSinger JohnJohn 0.500.50

ProducerProducer MattMatt 0.150.15

ProducerProducer DavidDavid 0.150.15

SingerSinger JacksonJackson 0.100.10

SingerSinger BobbyBobby 0.050.05

Type TableType Table

answeranswer:: q(?x):- has-income(<singer> ?x, ?y),q(?x):- has-income(<singer> ?x, ?y),

has-income(<producer> ?m, ?n),has-income(<producer> ?m, ?n), being-producer(?m, ?x),being-producer(?m, ?x),

(?y > ?n)(?y > ?n)

Page 33: Structured Querying of Web Text A Technical Challenge

2626

Make a QueryMake a Query example:example:

PredicatePredicate Obj.1Obj.1 Obj.2Obj.2 ProbProb

inventedinvented EdisonEdison Light bulbLight bulb 0.550.55

inventedinvented EdisonEdison TelescopeTelescope 0.140.14

inventedinvented EdisonEdison PhonographPhonograph 0.140.14

inventedinvented JasonJason Cell phoneCell phone 0.140.14

died-indied-in MaryMary 18771877 0.050.05

Fact TableFact Table

example7:example7: list all inventions discovered by Edisonlist all inventions discovered by Edison

answeranswer:: q(?i):- discovered(Edison, ?i)q(?i):- discovered(Edison, ?i)

ii ProbabilityProbability

Light bulbLight bulb 0.55 x 0.50.55 x 0.5

TelescopeTelescope 0.14 x 0.50.14 x 0.5

PhonographPhonograph 0.14 x 0.50.14 x 0.5

q Tableq Table

ConstConst Obj.1Obj.1 Obj.2Obj.2 ProbProb

IDID InventedInvented discovereddiscovered 0.500.50

SynSyn EdisonEdison Edison TEdison T 0.150.15

SynSyn EdisonEdison Thomas EThomas E 0.100.10

Constraint TableConstraint Table

DiscussionDiscussion:: In this case, What can we do In this case, What can we do

to answer this query?to answer this query?

Page 34: Structured Querying of Web Text A Technical Challenge

22

Make a QueryMake a Query Problem ScenarioProblem Scenario

3333

example8: (this example involves PROJECTION)example8: (this example involves PROJECTION) list all name who invented somethinglist all name who invented something

?x?x ?y?y ProbProb

EdisonEdison light bulblight bulb 0.340.34

EdisonEdison telescopetelescope 0.130.13

EdisonEdison PhonographPhonograph 0.130.13

TreeTree TableTable 0.090.09

Tree Tree Pen Pen 0.090.09

TreeTree PaperPaper 0.090.09

TreeTree FruitFruit 0.090.09

TreeTree ForestForest 0.090.09

TreeTree EraserEraser 0.090.09

TreeTree rulerruler 0.090.09

Joining TableJoining Table

answeranswer:: q(?x):- invented(?x, ?y)q(?x):- invented(?x, ?y)

?x?x ProbProb

TreeTree 0.630.63

EdisonEdison 0.600.60

q Tableq Table 0.63 = 0.09 x 7

Discussion:

•Can you see something wrong in the resulting table?

Page 35: Structured Querying of Web Text A Technical Challenge

Problem scenario caused by projection Problem scenario caused by projection operation.operation.

Conventional Way: Conventional Way: newProb = newProb = duplicateProb duplicateProbii

New Way: New Way: usingusing “Panel of Expert” “Panel of Expert” techniquetechnique principle:principle:

1.define number n of duplicate output 1.define number n of duplicate output ex. n=5 (meaning that if in total, there are 10 ex. n=5 (meaning that if in total, there are 10 duplicate output, we will consider only 5 and duplicate output, we will consider only 5 and eliminate other 5) to eliminate low quality eliminate other 5) to eliminate low quality output.output.

2.newProb = calculate by selecting the max value 2.newProb = calculate by selecting the max value among those n duplicate output.among those n duplicate output.

newProb = max {duplicateProbnewProb = max {duplicateProbii}; i}; in n

3434

Solving Problem Scenario by usingSolving Problem Scenario by using ‘‘Panel of ExpertPanel of Expert’’

techniquetechnique

Page 36: Structured Querying of Web Text A Technical Challenge

22

Make a QueryMake a Query Problem Scenario:Problem Scenario:

3535

example8: (problem caused by projection operation)example8: (problem caused by projection operation) list all name who invented somethinglist all name who invented something

?x?x ?y?y ProbProb

EdisonEdison light bulblight bulb 0.340.34

EdisonEdison telescopetelescope 0.130.13

EdisonEdison PhonographPhonograph 0.130.13

TreeTree WrongInfo1WrongInfo1 0.090.09

TreeTree WrongInfo2WrongInfo2 0.090.09

TreeTree WrongInfo3WrongInfo3 0.090.09

TreeTree WrongInfo4WrongInfo4 0.090.09

TreeTree WrongInfo5WrongInfo5 0.090.09

TreeTree WrongInfo6WrongInfo6 0.090.09

TreeTree WrongInfo7WrongInfo7 0.090.09

Joining TableJoining Table

answeranswer:: q(?x):- invented(?x, ?y),q(?x):- invented(?x, ?y),

?x?x ProbProb

EdisonEdison 0.340.34

TreeTree 0.090.09

q Tableq Table

0.63 = 0.09 x 7

Solved by“Panel of Expert”

technique

?x?x ProbProb

TreeTree 0.630.63

EdisonEdison 0.600.60

q Tableq Table

Page 37: Structured Querying of Web Text A Technical Challenge

Key points summary of 2Key points summary of 2ndnd Component: Component: (ExDB Compiler)(ExDB Compiler)1.1. ExDBExDB has its own syntax. has its own syntax.2.2. Result will be in table format.Result will be in table format.3.3. Last column is probability value Last column is probability value ranked by ranked by

decreasingdecreasing order of probability value. The order of probability value. The assumption is that the assumption is that the higher probabilityhigher probability, the , the more more accurateaccurate..

4.4. Can implement top K to reduce time complexity Can implement top K to reduce time complexity (increase performance).(increase performance).

5.5. In case of In case of JOIN JOIN table, the resulting probability table, the resulting probability the product of 2 joining tablethe product of 2 joining table

6.6. In case of In case of PROJECTIONPROJECTION, use , use Panel of ExpertPanel of Expert to solve to solve the problem.the problem.

7.7. In case that user’s query contains In case that user’s query contains relationrelation which which does not exist in the Fact Tabledoes not exist in the Fact Table, we can use , we can use Constraint TableConstraint Table to answer such a query. to answer such a query.

3636

22ndnd Component: Component: ExDB CompilerExDB Compiler

Page 38: Structured Querying of Web Text A Technical Challenge

Working On Task#1Working On Task#1

Synthetic TableSynthetic Table an additional feature to combine the an additional feature to combine the

result query q togetherresult query q together example:example:

3737

Synthetic Table generated by MERGING answers fromdied-in(?x,?y),invented(?x,?y),published(?x,?y),taught(?x,?y)

Page 39: Structured Querying of Web Text A Technical Challenge

Working On Task#2Working On Task#2

Implementing with Google Search EngineImplementing with Google Search Engine

3838

list all scientist, their inventions,who died before 1955

Search Textbox

GO

q(?x, ?y):- invented(<scientists>?x, ?y), died-in(?x, ?z), (?z < 1955)

Page 40: Structured Querying of Web Text A Technical Challenge

Compare result Compare result ExDBExDB && GoogleGoogle Test query:Test query: list all scientists who create list all scientists who create somethingsomething

3939

Output from ExDB

Output from GoogleComments:Comments:

ExDB performs much better than Google.ExDB performs much better than Google. For Google result, after investigating all the link, only For Google result, after investigating all the link, only

1 document comes close to the answer.1 document comes close to the answer. For ExDB, although they have some redundancy, answer is For ExDB, although they have some redundancy, answer is

still better.still better.

Page 41: Structured Querying of Web Text A Technical Challenge

ConclusionConclusion

Only Binary Predicate is allowed.Only Binary Predicate is allowed.

Result will be in table format (different from Google search Result will be in table format (different from Google search engine).engine).

How How ExDBExDB get answer makes more sense since they get answer makes more sense since they integrate integrate all data together before we make a query on themall data together before we make a query on them..

Extractor has to run beforehand before allowing user to make Extractor has to run beforehand before allowing user to make a query.a query.

IE involved in this paper are TextRunner, KnowItAll, DIRT.IE involved in this paper are TextRunner, KnowItAll, DIRT.

User is not expected to know the schema of the table, instead, User is not expected to know the schema of the table, instead, system itself will try to match as much as they can to answer system itself will try to match as much as they can to answer the query (using synonym, inclusion independency). the query (using synonym, inclusion independency).

4040

Page 42: Structured Querying of Web Text A Technical Challenge

Question?Question?

4242

??

Page 43: Structured Querying of Web Text A Technical Challenge

ReferencesReferences

N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.

D. V. K. Reynold Cheng and S. Prabhakar. Evaluating proba bilistic queries over imprecise data. In SIGMOD, pages 551

–562, 2003.

4141