tuning topical queries through context vocabulary...

29
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based Approach Carlos M Lorenzetti Ana G Maguitman [email protected] [email protected] Universidad Nacional del Sur Av. L.N. Alem 1253 Bahía Blanca - Argentina Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Laboratorio de Investigación y Desarrollo en Inteligencia Artificial CONICET AGENCIA

Upload: others

Post on 17-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment:

a Corpus-Based Approach

Carlos M Lorenzetti Ana G [email protected] [email protected]

Universidad Nacional del SurAv. L.N. Alem 1253Bahía Blanca - Argentina

Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento

Laboratorio de Investigación y Desarrollo en Inteligencia Artificial

CONICET AGENCIA

Page 2: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Introduction

Page 3: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 3

Context–Based SearchJava?

Page 4: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 4

Context–Based SearchJava?

AnimalsAnimals

ComputersComputers

ConsumablesConsumables

EntertainmentEntertainment

GeographyGeography

FloraFlora

ShipsShips

Page 5: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 5

Context–Based Search

ContextContext

ArticlesArticlesNewspapersNewspapers

OthersOthers

Page 6: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 6

Context–Based SearchJava?

ContextContext

ArticlesArticlesNewspapersNewspapers

OthersOthersGeographyGeography

Page 7: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 7

Query tuning

• Step 1: Query• Step 2: Initial set of results• Step 3: Relevance assessment

– SupervisedSupervised feedback– UnsupervisedUnsupervised feedback– SemiSemi--supervisedsupervised feedback

• Step 4: Better representation• Step 5: Revised set of results

Page 8: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 8

Different Role of Terms

• DescriptorsTerms that appear often in documentsrelated to the given topic

What is this topic about?

• DiscriminatorsTerms that appear only in documentsrelated to the given topic

What terms are useful to seek similar information?

Page 9: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 9

Different Role of Terms

• DescriptorsTerms that appear often in documentsrelated to the given topic

What is this topic about?

• DiscriminatorsTerms that appear only in documentsrelated to the given topic

What terms are useful to seek similar information?

RecallRecall

Page 10: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 10

Different Role of Terms

• DescriptorsTerms that appear often in documentsrelated to the given topic

What is this topic about?

• DiscriminatorsTerms that appear only in documentsrelated to the given topic

What terms are useful to seek similar information?

PrecisionPrecision

RecallRecall

Page 11: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Descriptors and DiscriminatorsComputation:an example

Page 12: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 12

Descriptors and Discriminators

JavaJavaLanguageLanguage

AppletsApplets

CodeCode

TopicTopic: Java Virtual : Java Virtual MachineMachine

NetBeansNetBeansComputersComputers

JVMJVM

RubyRuby ProgrammingProgramming

JDKJDK

VirtualVirtual

MachineMachine

Page 13: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 13

Descriptors and Discriminators

JavaJavaLanguageLanguage

AppletsApplets

CodeCode

TopicTopic: Java Virtual : Java Virtual MachineMachine

NetBeansNetBeansComputersComputers

JVMJVM

RubyRuby ProgrammingProgramming

JDKJDK

VirtualVirtual

MachineMachineGoodGood descriptorsdescriptors

Page 14: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 14

Descriptors and Discriminators

JavaJavaLanguageLanguage

AppletsApplets

CodeCode

TopicTopic: Java Virtual : Java Virtual MachineMachine

NetBeansNetBeansComputersComputers

JVMJVM

RubyRuby ProgrammingProgramming

JDKJDK

VirtualVirtual

MachineMachine

GoodGood discriminatorsdiscriminators

Page 15: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 15

Documents Descriptors and Discriminators

ktd ji ],[HNumberNumber ofof occurrencesoccurrences ofof termterm jjin in documentdocument ii

TopicTopic: Java Virtual : Java Virtual MachineMachine

0330

0120

1004

2004

3003

0220

1120

0110

0236

2552

InitialInitialContextContext

HH

(1)(1) espressotec.comespressotec.com(2)(2) netbeans.orgnetbeans.org(3)(3) sun.comsun.com(4)(4) wikitravel.orgwikitravel.org

0jdk

0jvm

0province

0island

0coffee

3programming

1language

1virtual

2machine

4java

(1)(1) (2)(2) (3)(3) (4)(4)

Page 16: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 16

Documents Descriptors

TopicTopic: Java Virtual : Java Virtual MachineMachineInitialInitial ContextContext

0jdk

0jvm

0province

0island

0coffee

3programming

1language

1virtual

2machine

4java

1

02]),[(

],[),(n

k

jiki

jitdH

H

DescriptiveDescriptive powerpower ofof a a termterm in a in a documentdocument

0,000

0,000

0,000

0,000

0,000

0,539

0,180

0,180

0,359

0,718

),( 0 jtd

Page 17: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 17

Documents Discriminators

TopicTopic: Java Virtual : Java Virtual MachineMachineInitialInitial ContextContext

0jdk

0jvm

0province

0island

0coffee

3programming

1language

1virtual

2machine

4java

1

0]),[s(

]),[s(),(m

k

jiki

jidtT

T

H

H

DiscriminatingDiscriminating powerpower ofof a a termterm in a in a documentdocument

0,000

0,000

0,000

0,000

0,000

0,577

0,500

0,577

0,500

0,447

),( 0dti

Page 18: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 18

Documents comparison criteria

1

0)),(),((),(

n

kkjkiji tdtddd Documents

similarityDocumentsDocuments

similaritysimilarity

KK11

KK33KK22

dd22

dd11

CosineCosine similaritysimilarity

Page 19: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 19

Topics DescriptorsTopicTopic: Java Virtual : Java Virtual MachineMachine

InitialInitial ContextContext

0jdk

0jvm

0province

0island

0coffee

3programming

1language

1virtual

2machinemachine

4javajava

1

,0

1

,0

2

),(

)),(),((),( m

ikkki

m

ikkjkki

ji

dd

tdddtd

TermTerm descriptivedescriptive powerpower in a in a topictopic ofof a a documentdocument

0,014

0,032

0,040

0,040

0,055

0,064

0,089

0,124

0,1580,158

0,3850,385

),( 0 jtd

Page 20: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 20

Topics DiscriminatorsTopicTopic: Java Virtual : Java Virtual MachineMachine

InitialInitial ContextContext

0province

0island

0coffee

4java

1language

2machine

3programming

1virtual

0jdkjdk

0jvmjvm

Term Term discriminatingdiscriminating powerpower in a in a topictopic ofof a a documentdocument

1

,0

2 )),(),((),(m

jkkkijkji dtdddt

0,385

0,385

0,385

0,493

0,517

0,524

0,566

0,566

0,8480,848

0,8480,848

),( 0dti

Page 21: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 21

Proposed Algorithm

ContextContextw1

w2

w3

w4

w5

w6

w7

w8

wm-1

wm wm-2

w9

...

Roulette

query 01

query 02

query 03

query n

result 03

result 01

result 02

result n

w 0,5w 0,25...w 0,1

1

2

m

DESCRIPTORSDESCRIPTORSDESCRIPTORS

w 0,4w 0,37...w 0,01

1

2

m

DISCRIMINATORSDISCRIMINATORSDISCRIMINATORS

1

24

3

Terms

Page 22: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Evaluation

Page 23: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 23

EvaluationLocal IndexLocal Index• DMOZ – ODP Project

– ~ 500 Topics– 3rd level of the hierarchy– 100 URLs at least– English language– > 350,000 pages

• Initial Context

• vs. Bo1 & Baseline– Novelty-Driven Similarity– Precision– Semantic Precision

TopTop

HomeHome ScienceScience ArtsArts

CookingCooking FamilyFamily

ChildcareChildcare

11stst levellevel

22ndnd levellevel

33rdrd levellevel

Page 24: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 24

Evaluation – N Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180iteration

nove

lty-d

riven

sim

ilarit

y

Maximum

Average

Minimum

Context update

Top/Computers/Open_Source/Software

Query formulation and retrieval process

[0.5866; 0.6073]0.5970best

[0.0618; 0.0704]0.06611st

95% CIMeanN

Page 25: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 25

Evaluation – N Similarity

[0.0822; 0.0924]0.087Baseline

[0.5866; 0.6073]0.597Incremental

[0.0710; 0.0803]0.075Bo1-DFR

95% CIMeanN

Page 26: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 26

Evaluation – Precision

>>

Incremental (66.96%)Bo1-DFR (24.33%)Baseline (8.7%)

CC[0.2461; 0.2863]0.266Baseline

[0.3325; 0.3764]0.354Incremental

[0.2859; 0.3298]0.307Bo1-DFR

95% CIMeanPrecision

Page 27: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 27

Evaluation – Semantic Precision

>>

Incremental (65.18%)Bo1-DFR (27.90%)Baseline (6.92%)

CC[0.5383; 0.5679]0.553Baseline

[0.6068; 0.6372]0.622Incremental

[0.5750; 0.6066]0.590Bo1-DFR

95% CIMeanPrecisionS

Page 28: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G

1st QSI - OTM – Monterrey, Mexico 28

Conclusions

• We presented an intelligent IR approach for learning context-specific terms.– Take advantage of the user contextuser context.

• We have shown evaluations and the effectivenesseffectiveness of incremental methods.

Future Work– Investigate different parameters.– Develop methods to learn and adjust parameters.– Run additional tests using other IR metrics.

Page 29: Tuning Topical Queries through Context Vocabulary ...ir.cs.uns.edu.ar/publications/downloadextra/4.pdf · Inteligencia Artificial CONICET AGENCIA. Introduction. Tuning Topical Queries

ThankThank youyou!!

CONICET AGENCIA

Laboratorio de Investigación y Desarrollo en Inteligencia Artificial

lidia.cs.uns.edu.ar

Universidad Nacional del SurBahía Blanca

www.uns.edu.ar