a study on query expansion methods for patent retrieval walid magdygareth jones centre for next...

16
A Study on Query A Study on Query Expansion Methods for Expansion Methods for Patent Retrieval Patent Retrieval Walid Magdy Walid Magdy Gareth Jones Gareth Jones Centre for Next Generation Localisation School of Computing Dublin City University 24 October 2011

Upload: archibald-oconnor

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

A Study on Query Expansion A Study on Query Expansion Methods for Patent RetrievalMethods for Patent Retrieval

Walid MagdyWalid Magdy Gareth JonesGareth Jones

Centre for Next Generation Localisation

School of Computing

Dublin City University

24 October 2011

Page 2: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Outline

What is the Problem?

Why Patents?

Current Solutions

Testing Existing Approaches

New Approach

Results

Conclusion

Motivation

Patent Characteristics

Prior Work

Applying Standard QE

Novel Method

Outcome

Findings

Agenda

Page 3: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Why Patents?

Challenging wording

Using vague and general terms

Strange combination of terms

No defined query (what words to select for search?)

Low retrieval effectiveness

Recall-oriented IR task

Hypothesis:QE better query/doc match better results

Page 4: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Prior Work

Pseudo Relevance Feedback (PRF)(Kishida K, NTCIR-3; Itoh H, NTCIR-4)

QE using Rocchio formula: no significant improvementQE using Taylor formula: no significant improvementReweighting query terms using PRF: no significant improvement

Inter Query Expansion (QE) for Patent Invalidity Search(Takeuchi H. et al, NTCIR-5)

QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks

Improving Retrievability for Patents(Bashir and Rauber, ECIR 2010)

Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task

Page 5: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Testing QE for Prior-Art Patent Search

CLEF-IP 2010:1.35M patents from the EPO

1.35K English patent topics

Collection contains EN/FR/DE patents, with translations of titles and claims in three languages

Expand query by: PRF vs. WordNet

Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query)

MAP and PRES was used for evaluationBL: 0.14 MAP, 0.486 PRES

Page 6: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Applying Pseudo Relevance Feedback

PRF implemented in Indri was used

Different values of FB terms and docs was tested

 Terms

 

Docs10 20 30 50

MAPBL = 0.1399

5 0.037 0.053 0.062 0.072

10 0.031 0.046 0.053 0.061

20 0.026 0.036 0.042 0.049

PRESBL = 0.486

5 0.196 0.234 0.247 0.265

10 0.190 0.222 0.235 0.251

20 0.178 0.205 0.216 0.232

Page 7: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Using WordNet for Expansion

Expand terms in query using synonyms, hyponyms for nouns and verbs

Apply QE to sample 100 topics, then use best combination to the full 1.35k topics set

  MAP PRES  value %change value %change

Baseline 0.1668 NA 0.584 NANS 0.1680 +0.7% 0.562 -3.7%NS+NH 0.1680 +0.7% 0.561 -3.8%NS+VS 0.1677 +0.5% 0.551 -5.6%NS+NH+VS+VH 0.1540 -7.6% 0.544 -6.8%

Baseline 0.1399 NA 0.486 NA

WordNet (NS) 0.1364 -2.5% 0.484 -1.0%

Page 8: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Standard QE Approaches

PRF:

Significant degradation in retrieval effectiveness.

This can be expected due to the low initial retrieval precision

WordNet:

Statistically significant degradation of results, but with some successful instances (31% of topics)

Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH)

A new effective and efficient QE method is required!

Page 9: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Automatically Generated SynSet

Align Sentences

Remove StopwordsStem Words

Align Terms

Backoff Alignment

English fields

French transl.

ENFR terms dic.

FREN terms dic.

ENEN terms dic.

process for eliminating foreign matter from a waste heat stream

procédé pour éliminer de la matière étrangère d'un courant de chaleur perdue

process elimin foreign matter wast heat stream

procéd élimin mati étrangèr cour chaleur perdu

elimin:élimin 0.71

elimin 0.13

élimin: remov 0.71

elimin 0.14

elimin:remov 0.6

elimin 0.16

elimin:remov 0.85

elimin 0.15

Page 10: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Samples of the Output

motormotor weightweight traveltravel colorcolor linklink

motormotor 0.64

enginengin 0.36

weightweight 0.86

wtwt 0.14

traveltravel 0.67

movemove 0.19

displacdisplac 0.14

colorcolor 0.56

colourcolour 0.25

dyedye 0.19

linklink 0.4

connectconnect 0.18

bondbond 0.17

crosslinkcrosslink0.13

bindbind 0.12

clothcloth tubetube areaarea gamegame playplay

fabricfabric 0.36

clothcloth 0.3

garmentgarment 0.2

tissutissu 0.14

tubetube 0.88

pipepipe 0.12

areaarea 0.4

zonezone 0.23

regionregion 0.2

surfacsurfac 0.17

setset 0.6

gamegame 0.4

setset 0.3

playplay 0.24

readread 0.2

gamegame 0.16

reproducreproduc0.1

Page 11: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

SynSet QE Results

8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets

Two runs were adopted:Expanding query using SynSet without weights (Usynset)

Utilizing SynSet probabilities as weights to terms in query

  MAP PRES

  value %change value %change

Baseline 0.1399 NA 0.486 NA

Wsynset 0.1440 +2.9% 0.485 -0.7%

Usynset 0.1402 +0.2% 0.480 -1.7%

Page 12: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

SynSet Expansion

Significantly better MAP, but significantly worse PRESi.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall

Some topics were improved (34% of topics), but some were degraded (39% of topics).

Significantly more efficient than PRF and WordNet (query size is only 60% larger)

Page 13: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Deeper Look on SynSet

No features with high correlation to SynSet QE success

Initial retrieval quality of BL does not relate to the performance of QE

Topic ID Baseline Wsynset %change   Topic ID Baseline Wsynset %change

PAC-1704 0.000 0.174 +∞   PAC-1510 0.030 0.012 -60%

PAC-195 0.000 0.215 +∞   PAC-210 0.160 0.000 -100%

PAC-1225 0.105 0.532 +408%   PAC-220 0.201 0.000 -100%

PAC-1670 0.124 0.637 +415%   PAC-56 0.263 0.040 -85%

PAC-954 0.514 0.763 +48%   PAC-784 0.323 0.027 -92%

PAC-122 0.590 0.944 +60%   PAC-42 0.459 0.216 -53%

PAC-579 0.630 0.902 +43%   PAC-906 0.571 0.214 -63%

PAC-1113 0.669 0.880 +32%   PAC-1498 0.662 0.307 -54%

Page 14: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Conclusions

PRF is not effective with patent prior-art search

WordNet QE for patent search:Leads to overall significant degradation of retrieval

Has some positive impact on the retrieval of some topics

High computational cost

SynSet QE for patent search:The most effective and efficient QE technique among those tested

Significant improvement for very high ranks, but significant degradation of overall ranking and recall

No indication of when it fails/succeeds

SynSet can be used as a lexical resource for patent examiners

Page 15: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Future Work

More analysis to better understand when QE fails/succeeds

Applying SynSet on real patent examiners’ queries rather than automatically formulated queries

Combining different QE methods

Alternative methods for query modification, for example query reduction (QR)

Page 16: A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City

Please Check in CIKM Poster Session

Magdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search.

Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents.

Thank youThank you