probabilistic structured query methods

Probabilistic Structured Probabilistic Structured Query MethodsQuery Methods

Kareem Darwish, Douglas W.OardKareem Darwish, Douglas W.OardElectrical and Computer Electrical and Computer

Engineering Department, College Engineering Department, College of Information Studies and UMIACSof Information Studies and UMIACS

AbstractAbstract

Structured methods for query term Structured methods for query term replacement rely on separate replacement rely on separate estimates of replacement of estimates of replacement of probabilities of term set.probabilities of term set.

Statistically term frequency and Statistically term frequency and document frequency to compute a document frequency to compute a weight for each query term.weight for each query term.

AbstractAbstract

This paper reviews prior work on structured This paper reviews prior work on structured query techniques and introduces three query techniques and introduces three variants that estimated improvements in variants that estimated improvements in retrieval effectiveness are demonstrated retrieval effectiveness are demonstrated for cross-language retrieval and for for cross-language retrieval and for retrieval based on optical character retrieval based on optical character recognition (OCR) when replacement recognition (OCR) when replacement probabilities are used to estimate both probabilities are used to estimate both term frequency and document frequency.term frequency and document frequency.

IntroductionIntroduction

There are many situations in which it’s There are many situations in which it’s desirable to match a query term with desirable to match a query term with different terms in a document, such as different terms in a document, such as stemming, thesaurus expansion and cross-stemming, thesaurus expansion and cross-language retrieval.language retrieval.

When the mappings among matching When the mappings among matching terms are known in advance, the usual terms are known in advance, the usual approach is to conflate the alternatives approach is to conflate the alternatives during indexingduring indexing


Query-time implementations are necessary Query-time implementations are necessary when appropriate matching decisions when appropriate matching decisions depend on the nature of the query.depend on the nature of the query.

Here presently known techniques for Here presently known techniques for query-time replacement are reviewed, new query-time replacement are reviewed, new techniques that leverage estimates of techniques that leverage estimates of replacement probabilities are introduced, replacement probabilities are introduced, and experiment results that demonstrate and experiment results that demonstrate improved retrieval effectiveness in two improved retrieval effectiveness in two applications are presented.applications are presented.


CLIR has received more attention than any CLIR has received more attention than any other query-time replacement problem in other query-time replacement problem in recent years.recent years.

Query translation research has developed Query translation research has developed along two broad directions: “dictionary-based” along two broad directions: “dictionary-based” and “corpus-based” techniques.and “corpus-based” techniques.

TFTF is a measure of aboutness, which has is a measure of aboutness, which has beneficial effects on both precision and recall. beneficial effects on both precision and recall.

DFDF is a measure of specificity, and its principal is a measure of specificity, and its principal effect is on precision.effect is on precision.

Replacement TechniquesReplacement Techniques

Pirkola appears to have been the first to Pirkola appears to have been the first to try separately estimating try separately estimating TFTF and and DFDF for for query terms in CLIR, using the query terms in CLIR, using the InQueryInQuery synonym operator to implement what he synonym operator to implement what he called “structured queries”.called “structured queries”.

InQueryInQuery’s synonym operator was ’s synonym operator was originally designed to support monolingual originally designed to support monolingual thesaurus expansion, so it estimates thesaurus expansion, so it estimates TFTF and and DFDF as follows: as follows:


where where QQii is a query term, is a query term, DDkk is a document term, is a document term, TFTFjj(Q(Qii)) is the term frequency of is the term frequency of QQii in document in document jj and and TTjj(Q(Qii)) is the set of known replacements (in is the set of known replacements (in CLIR, translations) for the term CLIR, translations) for the term DDkk..

This represents a very cautious strategy in This represents a very cautious strategy in which a high which a high DFDF for any replacement will result for any replacement will result in a high “joint DF” for that query term.in a high “joint DF” for that query term.


Kwok was the first to introduce a variant Kwok was the first to introduce a variant to Pirkola’s method:to Pirkola’s method:

Another alternative, not previously Another alternative, not previously explored, would be to use the maximum explored, would be to use the maximum document frequency of any replacement document frequency of any replacement (MDF):(MDF):


All three techniques treat every known All three techniques treat every known replacement as equally likely.replacement as equally likely.

This risks a somewhat counterintuitive This risks a somewhat counterintuitive result: introduction of a translation result: introduction of a translation dictionary with improved coverage of rare dictionary with improved coverage of rare translations could actually harm retrieval translations could actually harm retrieval effectiveness.effectiveness.

This exact situation actually arises often This exact situation actually arises often with dictionaries built from aligned corpora with dictionaries built from aligned corpora using statistical methods.using statistical methods.


One way to address this problem would be One way to address this problem would be to use a weighted variant of Kwok’s method:to use a weighted variant of Kwok’s method:

For the experiments reported below, the For the experiments reported below, the weight is set to the best available estimate weight is set to the best available estimate of the replacement probabilities.of the replacement probabilities.


Another way of leveraging information Another way of leveraging information about replacement probabilities would be about replacement probabilities would be to simply ignore the least likely to simply ignore the least likely replacements.replacements.

For the experiments reported below, a For the experiments reported below, a greedy method was used, with greedy method was used, with replacements retained in order of replacements retained in order of decreasing probability until a preset decreasing probability until a preset threshold on the cumulative probability threshold on the cumulative probability was first exceed.was first exceed.


The following combinations were tried:The following combinations were tried:

CLIRCLIR

Using the TREC 2002 CLIR track collection, Using the TREC 2002 CLIR track collection, which contains 383,872 articles from the which contains 383,872 articles from the Agence France Press (AFP) Arabic Agence France Press (AFP) Arabic newswire, 50 topic descriptions written in newswire, 50 topic descriptions written in English, and associated relevance English, and associated relevance judgments.judgments.

Five translation resources of three types Five translation resources of three types were combined for this application.were combined for this application.

CLIRCLIR

The resources were:The resources were:1.1. Two bilingual term lists that were constructed Two bilingual term lists that were constructed

using two Web-based machine translation using two Web-based machine translation systems (Tarjim and Al-Misbar). The two lists systems (Tarjim and Al-Misbar). The two lists covered about 15% of the unique Arabic covered about 15% of the unique Arabic stems in the TREC collection.stems in the TREC collection.

2.2. The Salmone Arabic-to-English dictionary, The Salmone Arabic-to-English dictionary, from which we extracted only the from which we extracted only the translations. The coverage was about 7%.translations. The coverage was about 7%.

3.3. Two translation probability tables, one for Two translation probability tables, one for English-to-Arabic and one for Arabic-to-English-to-Arabic and one for Arabic-to-English. The coverage was 29%.English. The coverage was 29%.

CLIRCLIR

These translation resources were combined These translation resources were combined in the following manner:in the following manner: All resources that were originally provided as All resources that were originally provided as

Arabic-to-English were inverted. This process Arabic-to-English were inverted. This process likely introduce some error when inverting the likely introduce some error when inverting the translations probability table.translations probability table.

A uniform distribution was used to assign A uniform distribution was used to assign probabilities to the translations obtained from probabilities to the translations obtained from machine translation systems and the Salmone machine translation systems and the Salmone dictionary.dictionary.

A uniform distribution was then assumed over A uniform distribution was then assumed over the translation resources containing each English the translation resources containing each English term.term.

CLIR ResultsCLIR Results

Baseline: one-best query translationBaseline: one-best query translation

CLIR ResultsCLIR Results

OCR-Based RetrievalOCR-Based Retrieval

Previous approaches to the OCR-based Previous approaches to the OCR-based retrieval problem have focused primarily on retrieval problem have focused primarily on correcting OCR errors or on fuzzy matching correcting OCR errors or on fuzzy matching techniques.techniques.

Using the Zad collection, which was Using the Zad collection, which was developed at the University of Maryland.developed at the University of Maryland.

The collection consists of 2,730 documents The collection consists of 2,730 documents extracted from extracted from Zad AlMe’adZad AlMe’ad, a printed book , a printed book for which an accurately character coded for which an accurately character coded electronic version (the “clean text”) is also electronic version (the “clean text”) is also available.available.


25 written topic descriptions.25 written topic descriptions. Term replacement probabilities were Term replacement probabilities were

estimated using a position-sensitive estimated using a position-sensitive unigram character distortion model unigram character distortion model trained on 5,000 words of trained on 5,000 words of automatically aligned clean and OCR-automatically aligned clean and OCR-degraded text from the Zad degraded text from the Zad collections.collections.


Given a clean word with characters Given a clean word with characters CC11..C..Cii..C..Cnn and the resulting word after OCR and the resulting word after OCR degradation degradation DD11..D..Djj..D..Dmm, three probabilities , three probabilities of edit operations would be modeled after of edit operations would be modeled after alignment:alignment:


Clean textClean text

OCR-degraded text

aligned by SCLITE(using dynamic programming string alignment algorithm)

back tracing to identify three kinds of operations

estimate replacement probabilities

OCR ResultsOCR Results

ConclusionConclusion

This paper introduced a family of methods This paper introduced a family of methods for query term replacement that exploit for query term replacement that exploit estimates of replacement probabilities.estimates of replacement probabilities.

Inclusion of rare translations in a CLIR Inclusion of rare translations in a CLIR application was shown to be problematic application was shown to be problematic for all three methods.for all three methods.

Of the three probabilistic structured query Of the three probabilistic structured query methods, WTF/DF was the winner, yielding methods, WTF/DF was the winner, yielding both the greatest retrieval effectiveness both the greatest retrieval effectiveness and the least sensitivity to the threshold and the least sensitivity to the threshold tuning.tuning.

Future WorkFuture Work

Term weight tuningTerm weight tuning Other applicationsOther applications Structured document indexing (e.g. Structured document indexing (e.g.

translation based indexing)translation based indexing)

probabilistic structured query methods

Documents

query term replacement

document term

querytime replacement

replacement mdf

replacement techniquesall

replacement techniquespirkola

replacement techniqueskwok

query terms