iepad: information extraction based on pattern discovery

37
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiw an http://www.csie.ncu.edu.tw/~chia

Upload: mitch

Post on 19-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

IEPAD: Information Extraction based on Pattern Discovery. Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia. Outline. Introduction Problem definition Related Work System architecture Extraction rule generation Experiments Summary and future work. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IEPAD: Information Extraction based on Pattern Discovery

IEPAD: Information Extraction based on Pattern Discovery

Chia-Hui Chang

National Central University, Taiwan

http://www.csie.ncu.edu.tw/~chia

Page 2: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 2

Outline

Introduction Problem definition Related Work

System architecture Extraction rule generation Experiments Summary and future work

Page 3: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 3

Introduction

Web information integration multi-search engines, e.g. Metacrawler shopping agents etc.

Common tasks Data collection Information extraction

Page 4: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 4

Information Extraction

Information Extraction (IE) Input: Html pages Output: A set of records

Page 5: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 5

Related Work

Extractor Generation Hand-coded wrappers by observation Machine learning based approach

• WIEN (Kushmeric), 1997• SoftMealy (Hsu), 1998• STALKER (Muslea), 1999

Fully automatic approach• Embley et al, 1999• Chang et al, 2000

Page 6: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 6

System Architecture

Rule Generator

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

Page 7: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 7

Pattern Discovery based IE

Motivation• Display of multiple records often forms a repeated

pattern• The occurrences of the pattern are spaced regularly

and adjacently

Now the problem becomes ...• Find regular and adjacent repeats in a string

Page 8: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 8

The Rule Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

Page 9: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 9

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

<B>Congo</B><I>242</I><BR>

<B>Egypt</B><I>20</I><BR>

Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Page 10: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 10

Various Encoding Schemes

B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings

Text containers

Lists

Others

H1~H6

P, PRE, BLOCKQUOTE,ADDRESS

UL, OL, LI, DL, DIR,MENU

DIV, CENTER, FORM,HR, TABLE, BR

Logical markup

Physical markup

Special markup

EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE

TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT

A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA

Figure. 2 Tag classification

Page 11: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 11

Example of BL Encoding

Encoding scheme=Block-Level Tags1’. Only block-level tags are considered, each tag i

s encoded as a token2. Any text between two tags are translated to a spe

cial token called TEXT (denoted by a underscore)

<dl><dt><b>1.</b><b><a ...>MGI 2.4 - Mouse <em>Genome</em> … </a><dd>The Mouse <b>Genome</b> Informatics (MGI) ..<br><span>URL:www.informatics.jax.org/ </span><br><a ...> …</a><a ...>…</a><img src=…><a ...>…</a>Facts about:<a> …</a></dl><dl> <dt> _ <dd> _ <br> _ <br> _ </dl> 1 5 9 64 68

Page 12: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 12

2. PAT Tree Construction

PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T(<B>) 000

T(</B>) 001

T(<I>) 010

T(</I>) 011

T(<BR>) 100

T(_) 110

000110001010110011100000110001010110011100

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

Page 13: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 13

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Page 14: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 14

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pair s

uch that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) pai

r such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and rig

ht maximal

Page 15: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 15

Finding Maximal Repeats

Definition: Let’s call character S[pi-1] the left character of s

uffix pi

A node is left diverse if at least two leaves in the ’s subtree have different left characters

Lemma: The path labels of an internal node in a PAT tre

e is a maximal repeat if and only if is left diverse

Page 16: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 16

3. Pattern Validator Suppose a maximal repeat are ordered by its position such t

hat suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Page 17: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 17

Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()

a) check if the pattern’s variance: V() < 0.5b) check if the pattern’s density: 0.25 < D() < 1.5

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

Pattern

NoDiscard

Pattern

Page 18: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 18

4. Rule Composer

Occurrence partition Flexible variance threshold control

Multiple string alignment Increase density of a pattern

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

occurrences

No

Occurrence Partition

Multiple String

AlignmentD()<1

Yes

No

V()<0.1No

Discard

Page 19: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 19

Occurrence Partition

Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity

Solution Clustering of the occurrences of such a pattern

Clustering V()<0.1No

Discard

Check densityYes

Page 20: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 20

Multiple String Alignment

Problem Patterns with density less than 1 can extract only part of th

e information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Page 21: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 21

Multiple String Alignment (Cont.)

Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Page 22: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 22

Pattern Viewer

Java-application based GUI Web based GUI

http://140.115.155.102/WebIEPAD/

Page 23: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 23

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Page 24: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 24

Experiment Setup

Fourteen sources: search engines Performance measures

Number of patterns Retrieval rate and Accuracy rate

Parameters Encoding scheme Thresholds control

Page 25: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 25

# of Patterns Discovered Using BlockLevel Encoding

Figure 5. Number of Patterns validated

02468

101214

0 0.25 0.5 0.75 1

Density

# o

f p

att

ern

s

r=0.25

r=0.5r=0.75

Average 117 maximal repeats in our test Web pages

Page 26: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 26

Translation

Table 2. Size of translated sequences and number of patterns

Encoding Scheme Length of Sequence No. of Patterns

All Tag 1128 7.9

No Physical 873 6.5

No Special 796 5.7

Block-Level 514 4.4

Average page length is 22.7KB

Page 27: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 27

Accuracy and Retrieval Rate

Table 4. Effect of Advanced Techniques

Method Retrieval Rate Accuracy Rate Matching Percentage

Block-level Encoding 0.86 0.86 0.78

Occurrence Partition 0.92 0.91 0.85

Occurrence Partition +

Multiple String Alignment

0.97 0.94 0.90

Table 3. Basic screening (without Rule Composer)Encoding Scheme Retrieval Rate Accuracy Rate Matching Percentage

All Tag 0.73 0.82 0.60

No Physical 0.82 0.89 0.68

No Special 0.84 0.88 0.70

Block-Level 0.86 0.86 0.78

Page 28: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 28

Accuracy and Retrieval RateTable 5. The performance of multiple string alignment

Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler

1.001.001.001.000.970.980.941.000.900.950.831.000.990.98

1.001.000.970.950.860.940.631.000.960.960.900.951.000.98

0.910.971.000.990.880.870.940.760.780.900.660.970.950.98

Average 0.97 0.94 0.90

Page 29: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 29

Summary

IEPAD: Information Extraction based on Pattern Discovery Rule generator The extractor Pattern viewer

Performance 97% retrieval rate and 94% accuracy rate

Page 30: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 30

Problems

Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the

desired data Only applicable when there are several

records in a Web page, currently

Page 31: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 31

Final

Acknowledgement We would like to thank Lee-Feng Chien, Ming-Jer Lee an

d Jung-Liang Chen for providing their PAT tree code for us.

Reference Chang, C.H. and Lui, S.C. IEPAD: Information Extrac

tion based on Pattern Discovery, WWW10, May. 2001, Hong Kong.

Page 32: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 32

Future Work

Interface for choosing a pattern http://www.csie.ncu.edu.tw/~chia/webiepad/

Multi-level extraction From record boundary extraction to attribute valu

e extraction Extractors in Java and C++

Page 33: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 33

Rule Formatlevel 1 encoding scheme: rulelevel 2 encoding scheme: rule for block 1level 2 encoding scheme: rule for block 2...level 2 encoding scheme, rule for block klevel 1 block 1, level 2 block no for attribute 1level 1 block 1, level 2 block no for attribute 2...level 1 block 1, level 2 block no for attribute t

K 個 block

t個attribute

Page 34: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 34

Example(cont.)Line 0: Blocklevel.h, <DL><DT>String<DD>String<BR>String<BR>String<BR>String</DD></DL>Line 1: Alltag.h, rule for block 1Line 2: Alltag.h, rule for block 2...Line k: Alltag.h, rule for block kLine k+1: level 1 block no, level 2 block no for attribute 1Line k+2: level 1 block no, level 2 block no for attribute 2...Line k+t: level 1 block no, level 2 block no for attribute t

Demoex: 3, 2ex: 5, allex: 5, 1 3

Page 35: IEPAD: Information Extraction based on Pattern Discovery

Congo Example

Page 36: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 36

Performance Evaluation

Definition: A pattern is said to enumerate a record if the

overlapping percentage between the record and the pattern is greater than

Three Measures Retrieval Rate Accuracy Rate Matching Percentage

Page 37: IEPAD: Information Extraction based on Pattern Discovery

2001/5/4 37

Illustration

Let Gi,j denotes the ordered occurrences pi, pi+1, ..., pj

S=, i=1;For j=1 to k-1 do

If R(Gi,j+1) > then If R(Gi,j) < m then

S= S {Gi,j}; endif i= j+1;endif

endf