cuoliang li, beng chin ooi, jianhua feng, jianyong wang, lizhu zhou tsinghua university

18
EASE: An Effective 3-in-1 Keyword Search EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured Method for Unstructured, Semi-structured and Structured Data and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University SIGMOD 2008 SIGMOD 2008 2009. 03. 19. Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University

Upload: jenn

Post on 19-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data. Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University SIGMOD 2008 2009. 03. 19. Summarized by Jaehui Park , IDS Lab., Seoul National University - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

EASE: An Effective 3-in-1 Keyword Search Method EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured for Unstructured, Semi-structured and Structured DataData

Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou

Tsinghua University

SIGMOD 2008SIGMOD 2008

2009. 03. 19.

Summarized by Jaehui Park, IDS Lab., Seoul National University

Presented by Jaehui Park, IDS Lab., Seoul National University

Page 2: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

INTRODUCTIONINTRODUCTION

Keyword search capability into text documents, XML documents, and relational databases

Graph index

Instead of traditional inverted index

– Effective for unstructured data

– Inadequate for complex structural information.

EASE (Efficient and Adaptive keyword Search method)

Efficient algorithmic basis for scalable top-k-style processing of large amounts of heterogeneous data

– Employing and adaptive, efficient and novel index

2

Page 3: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

ContributionsContributions

Model for unstructured, semi-structured and structured data as graphs

Effective graph index as opposed to the inverted index

Novel ranking mechanism for both DB and IR viewpoint

Extensive performance study

3

Page 4: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

MotivationMotivation

Unstructured

Link awareness

– Relevant data may be separated into different pages but linked through hyperlinks

(Semi-) Structured

LCA (Lowest common ancestors)

– Connected tree with minimal cost

Ex) Steiner trees

4

Page 5: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

r-Radius Steiner Graph Problemr-Radius Steiner Graph Problem

Meaningful Steiner graphs with acceptable sizes

Several concepts

Centric distance

Radius

r-Radius Steiner tree

– Radius of a Steiner graph cannot be larger than r

5

Page 6: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

ExampleExample

DBLP example

6

Page 7: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

The r-Radius Seiner Graph ProblemThe r-Radius Seiner Graph Problem

Given a graph and an input keyword query K, the r-Radius Seiner Graph Problem is to find all the r-radius Steiner graphs in , which contain all or a portion of the input keywords in K, ranked by relevancy with K.

7

Page 8: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

EASE: An adaptive search methodEASE: An adaptive search method

Inverted indices are not effective for discovering the much richer structural relationships existing in databases with complicated structured [10].

Index r-radius Steiner graphs for each combination

– Very expensive

Proposed method

1. Discover r-radius graphs (indexing)

2. Extracting r-radius Steiner graphs (on the fly)

– By removing non-Steiner nodes

8

Page 9: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

EASE: An adaptive search methodEASE: An adaptive search method

Adjacency Matrix

Extracting r-radius graphs effectively

9

Page 10: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

EASE: An adaptive search methodEASE: An adaptive search method

Determining the subgraph that are r-radius graphs

By Lemma 1.

For efficient retrieval of r-radius graphs

– Graph index

r-radius graph that contain query keywords k

Extracting r-radius Steiner graphs

By Theorem 1.

10

Page 11: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

EASE: An adaptive search methodEASE: An adaptive search method

Computing the Steiner nodes

1111

Page 12: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

EASE: An adaptive search methodEASE: An adaptive search method

Maximal r-Radius Graph

Avoid redundancy

– Keep the maximal r-radius graphs in the graph index

Overlapping graphs

Graph partitioning

Avoid the incurrence of huge storage

Only need to retrieve the corresponding relevant graph partitions

Graph similarity

– Bigger overlap -> higher similarity

12

Page 13: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

SummarySummary

1. Obtain adjacency matrix M

2. Compute Mr

3. Extract the maximal r-radius graphs

4. Cluster the graphs by employing the existing K-means algorithm and partition the graph

5. Construct the graph index to materialize the maximal r-radius graphs

13

Page 14: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

OthersOthers

Ranking Functions

TF-IDF based IR-ranking

Structural Compactness-based DB Ranking

– Intuitively, when an r-radius Steiner graph SG is more compact, SG is more likely to be meaningful and relevant.

Indexing

14

Page 15: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

Experimental studyExperimental study

Dataset: DBLife, DBLP and IMDB

Comparison

Unstructured

– InfoUnit [18]

Semi-structured

– SLCA [28]

Structured

– DPBF [6]

15

Page 16: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

Experimental studyExperimental study

16

Page 17: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

Experimental studyExperimental study

17

Page 18: Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou Tsinghua University

Copyright 2008 by CEBT

ConclusionConclusion

Proposed an efficient and adaptive keyword search method

EASE

– Keyword queries over unstructured, semi-structured and structure data

Examined the issues of indexing and ranking

By taking into account both the structural compactness

Experimental results shows that EASE achieves both high search efficiency and quality for keyword search over heterogeneous data.

18