indexing mixed types for approximate retrieval

Liang Jin* UC Irvine

Nick Koudas University of Toronto

Chen Li* UC Irvine

Anthony K.H. Tung National University of Singapore

* Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586

Indexing Mixed Types for Approximate Retrieval

Queries with Mixed-Type Predicates

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1980| <= 5;• SIMLARTO:

– a domain-specific function – returns a similarity value between two strings

• Example: edit distance ed(Tom Hanks, Ton Hank) = 2

Why fuzzy predicates?

• Errors in queries– User doesn’t remember a string exactly– User types a wrong string

Samuel Jackson

Schwarzenegger

Samuel Jackson

Keanu ReevesStar

Samuel L. Jackson

Schwarzenegger

Samuel L. Jackson

Keanu ReevesStar

Relation R Relation S

• Errors in databases:– Data is not clean– Especially true in data integration and cleansing

Problem Formulation

SELECT *

FROM Movies

AND |year – 1980| <= 5;

Given: A query with fuzzy predicates on strings and

range predicates on numeric attributes

on a single relation

Goal: Answer the query efficiently

Rest of the talk

• Motivation: supporting queries with mixed-type predicates

• Our approach: MAT tree• Construction and maintenance of MAT tree• Experiments

Assumptions

SELECT *

FROM Movies

AND |year – 1980| <= 5;

• One fuzzy string predicate (edit distance)

• One numeric predicate

(’Schwarrzenger’, 2, 1980, 5)

(Qs, δs, Qn, δn)Query:

Intuition of MAT (Mixed-attribute-type) Tree

• “2 > 1 + 1”– One integrated indexing structure is better than– two independent indexing structures on two attributes

• Indexing numeric attributes: B-tree or R-tree• Indexing strings as a tree to support fuzzy predicates?

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

Leaf nodes

<1964,1968>

<1974,1977>

......

MAT tree

Answering a query (Qs, δs, Qn, δn)

• Top-down traverse the MAT-tree• At each node, do pruning by checking:

– If [Qn – δn, Qn + δn] overlap with the numeric range.

– If minEditDistance(Qs, Tn) <= δs.

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

Leaf nodes

<1964,1968>

<1974,1977>

......

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

Leaf nodes

<1964,1968>

<1974,1977>

......

Challenge

• How to represent strings to fit into a limited space• and support fuzzy-predicate pruning

Limited space (disk based)

Existing Approaches to Indexing Strings as Trees

• M-tree: – Edit distance: metric space

• Q-tree– Utilize the q-gram property of strings. – See our paper for details

Representing strings as a trie

n2 n3 n4

n5 n6 n7 n8

n10 n11 n12 n13

Strings:aad, abcde, abdfg, beh, ca, cb

n16 n17e

Compressing a trie

• Select k representative nodes (centers).

• Each center is in the format of <alphabet,height>.

• A compressed trie represents more strings

<{b},2>

<{e},1>

<{h},1>

<{a,b,c},2><{a},1>

<{a,d},2>

<{b,d},2>

<{f,g},2><{c,d,e},3>

n2 n3 n4

n5 n6 n7 n8

n10 n11 n12 n13

n16 n17e

compression

minEditDistace (Qs, Tn)?– Convert a trie to an automaton.– Compute the min distance between a string and an automaton [Myers and

Miller, 1989]– Early termination possible

Minimum edit distance between a string a trie

[c,*][c,*]

[a,a] [a,d]

[c,b][c,a] [c,d]

Automaton

Query String

“ac”

Edit Graph

Compressed trie Automaton

• Each node is a state.• Each edge becomes a transition between two states.• For compressed node <Σ, L>, expand it to L levels.

At each level, all characters in Σ become single states and are connected to a common tail ε.

Convert a compressed node <{a,b,c},2> into automaton nodes.

Outline

Constructing MAT-tree

• Option 1: insert records one by one. • Option 2:

– bulk-load records– construct the MAT-tree bottom-up

Compressing a trie

• Important:– Accurately represent strings in a limited space.– Minimize “information loss”.– Maintain the pruning power during a traversal.

• Three methods:– (1) Reducing # of accepted strings– (2) Keeping accepted strings “clustered”– (3) Combining of (1) and (2)

Method (1): Reducing # of accepted strings

• Intuition: – reducing this # makes the compressed trie more

accurate

• Goodness function: # of accepted strings• Algorithm: “Randomized”

– Randomly select k initial centers– Randomly select one of the centers– Randomly select an unselected node– Swap them if it can improve the goodness function– Do certain # of iterations

Method (2): Keeping accepted strings clustered

• Intuition: – keeping the accepted strings similar to the original ones by

letting them share common prefix. – Place k centers as close to the root as possible.

• Algorithm: “BreadthFirst”

n2 n3 n4

n5 n6 n7 n8 n9

<{a,d},2>

<{b,c,d,e,f,g},4> <{e,h},2>

<{a},1>

<{b},1>

Method (3): Combining (1) and (2)

• Intuition: – minimize the number of accepted strings, and in

the same time maintain their similarity to the originals.

• Algorithm: “Bottomup”– Keep shrinking the trie bottom up until we have k

nodes– Compress a node that minimizes # of additional

strings

Dynamic maintenance

Insertion (s, n)• Search the index for (s, n). If it’s not in the

index, identify the correct leaf node.• If no overflow:

– update the “MBR” of the leaf node and its precedents recursively if necessary.

• If overflow:– Split the leaf node and – Construct two compressed tries– Cascade the split to the precedents if necessary.

Deletion and Update are handled similarly

Outline

Setting

• Data– IMDB: 100K movie star records (Name and YOB).– Customers: 50K records (Name and YOB)

• Test bed– PC: 2.4G P4, 1.2GB Memory, Windows XP– Visual C++ compiler

• Similar results. Report result for IMDB.

Implemented approaches

• B-tree• Q-tree• B-tree & Q-tree• BQ-tree• BM-tree• Sequential scan

“BBQ-tree”?

“2 > 1 + 1”

An integrated indexing structure is better than two separate indexing structures

δs=3, δn=4

Scalability

Effect of numeric threshold δn

Effect of string threshold δs

Dynamic Maintenance: time

Dynamic maintenance: MAT quality

Number of centers

• Increasing cluster # may not reduce the running time: pruning power versus computational cost

• For BottomUp and BreadthFirst (compared to Randomized)

- Centers close to the root, thus more likely to do early termination

Conclusion

• MAT-tree: an efficient indexing structure for queries with mixed-type predicates

• Can be efficiently constructed and maintained

• Future work: develop a uniform framework to support different kinds of similarity functions

The Flamingo Project : http://www.ics.uci.edu/~flamingo/

Backup Slides

Constructing MAT-tree

• Option 1: inserting records one by one. • Option 2: bulk-loading data records and

constructing the MAT-tree in a bottom-up fashion.– Records are sorted based on one attribute.– Fill pages with records until full.– Calculate the numeric range and the compressed

trie for each leaf nodes.– Merge leaf nodes into internal nodes recursively

according to desired fanout, until a single root is formed.

Example – Customer Service Call Center

Name SSN YOB

Jack Lemmon 430-871-8294 1978

Harrison Ford 292-918-2913 1962

Tom Hanks 234-762-1234 1956

Tim Legler 125-457-8654 1870

… … …

Customer calls in

Issue a fuzzy query:

Name LIKE “Tom Hanks” AND YOB CLOSE to 1958

Return result

Serve the customer

In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes!

Scalability test (IO)

indexing mixed types for approximate retrieval

mat treeconstruction

intuition of mat

detailsrepresenting

accepted strings clustered3

compressed node

automaton nodes

mixedtype predicatesselect

rtreeindexing strings

Documents

folksonomies - indexing and retrieval for web 2.0

data-intensive information processing applications session...

document preprocessing and indexing si650: information...

semantic annotation, indexing, and retrieval

multimedia indexing and retrieval research at the center...

art extension for description, indexing and retrieval of

g.skobeltsyn | query-driven indexing for p2p text retrieval...

faster case retrieval using hash indexing technique

multimedia indexing and retrieval - imag

multimodal semantic indexing for image retrieval

audio based indexing and retrieval in muvis

indexing and retrieval

a indexing methods for approximate dictionary searching...

latent semantic indexing and probabilistic (bayesian)...

image indexing& retrieval using intermediate features

approximate indexing: gapped suffix array

latent semantic indexing for information retrieval

introduction to information retrieval introduction to...

vakhitov alexander approximate text indexing

query-driven indexing for scalable p2p text retrieval