abstract - texas a&m university-corpus christisci.tamucc.edu/~cams/projects/321.pdflanguage...

ii

ABSTRACT

The purpose of this project is to develop an algorithm to extract information from

semi–structured Web pages. Many Web applications that use information retrieval,

information extraction and automatic page adaptation can benefit from this structure. This

project presents an automatic top-down, tag-tree independent approach to detect Web

content structure. It simulates how a user understands Web layout structure based on his

visual perception. It also segments the Web page based on the data records that is the

most important information in the whole structure. Comparing to other existing

techniques, our approach is independent to the underlying documentation representation

such as HTML and works well even when the HTML structure is far different from the

layout structure. The current method works on a large set of Web pages. This project uses

VBS algorithm to extract information from most semi-structured Web pages. It

recommends some rules, which improve the performance of the algorithm.

iii

TABLE OF CONTENTS

Abstract ............................................................................................................................... ii

Table of Contents............................................................................................................... iii

List of Figures .................................................................................................................... iv

List of Tables ...................................................................................................................... v

1. Background And Rationale........................................................................................... 1

1.1 Document Object Model................................................................................. 3

1.2 Color Code Model........................................................................................... 5

1.3 Vision Based Page Segmentation ................................................................... 6

1.4 Visual based segmentation........................................................................... 12

2. Narrative ....................................................................................................................... 14

2.1 Visual Block Extraction................................................................................ 14

3. System Design .............................................................................................................. 17

3.1 Analysis......................................................................................................... 17

3.2 User Interface................................................................................................ 18

3.3 Vision-based Content Structure for Web Pages............................................ 23

4. Evaluation and Results.................................................................................................. 31

5. Future Work .................................................................................................................. 34

6. Conclusion .................................................................................................................... 35

Bibliography and References............................................................................................ 36

APPENDIX A. Websites Testing Results ........................................................................ 37

iv

LIST OF FIGURES

Figure 1.1 A semi-structure Web page from buy.com ................................................. 2

Figure 1.2 An example how VIPS segments a Web page into blocks.......................... 8

Figure 1.3 shows VB 1-1-1(8) ...................................................................................... 9

Figure 1.4 shows VB 1-1-2(4) ...................................................................................... 9

Figure 1.5 A sample web page segmented ................................................................. 10

Figure 1.5(a) Segmentation of Web page[Cai 2003].................................................. 11

Figure 1.5(b) An example of Web based content structure [Cai 2003]...................... 11

Figure 3.1 Overall dataflow...................................................................................... 17

Figure 3.2 A sample input Web page ......................................................................... 18

Figure 3.3 A sample VBS –partitioned Web page...................................................... 19

Figure 3.4 User interface with segmented data records.............................................. 20

Figure 3.5 A segmented Web page mapped to Visual blocks .................................... 22

Figure 3.6 VIPS segments data records as leaf node................................................. 24

Figure 3.7 A Web page analyzed by VIPS which identifies noise as node................ 25

Figure 3.8 Shows the drop down list which acts as noise........................................... 26

Figure 3.9 A Web page segmented by VIPS that identifies noise as data records .... 27

Figure 3.10 Complex Visual Structure ....................................................................... 27

Figure 3.11 A Web page analyzed by VIPS which identifies noise as node.............. 28

Figure 3.12 VBS segments data records.................................................................... 29

Figure 3.13 A Web page segmented by VBS that ignores noise................................ 30

v

LIST OF TABLES

Table 1 Evaluation Results………………………………………………………………27

1

1. BACKGROUND AND RATIONALE

Today the Web has become the largest information source for many people. Most

information retrieval systems on the Web consider Web pages as the smallest and undividable

units, but a Web page as a whole may not be appropriate to represent a single semantic idea. A

Web page usually contains various contents such as items for navigation, decoration, interaction

and contact information, which are not related to the topic of the Web page. Furthermore, a Web

page often contains multiple topics that are not necessarily relevant to each other. Therefore,

detecting the semantic content structure of a Web page could potentially improve the

performance of Web information retrieval [Cai 2003].

Web pages often contain clutter (such as unnecessary images and extraneous links)

around the body of an article that distracts a user from actual content. These “features” may

include pop-up ads, flashy banner advertisements, unnecessary images, or links scattered around

the screen. Extraction of “useful and relevant” content from Web pages has many applications,

including cell phone and PDA browsing, speech rendering for the visually impaired, and text

summarization [Gupta 2005].

People view a Web page through a Web browser and get a 2-D presentation image, which

provides many visual cues to help distinguish different parts of the page. Examples of these cues

include lines, blanks, images, font sizes, and colors, etc. For the purpose of easy browsing and

understanding, a closely packed region within a Web page is usually about a single topic. This

observation motivates us to segment a Web page from its visual presentation [Embley 1999].

2

Figure 1.1 A semi-structure Web page from buy.com

In the sense of human perception, it is always the case that people view a Web page as

different semantic objects rather than a single object. Some research efforts show that users

always expect that certain functional part of a Web page (e.g. navigational links, advertisement

bar) appear at certain position of that page. Actually, when a Web page is presented to the user,

the spatial and visual cues can help the user to unconsciously divide the Web page into several

semantic parts. Therefore, it might be possible to automatically segment the Web pages by using

the spatial and visual cues [Cai 2003].

Many Web applications can utilize the semantic content structures of Web pages. For

example, in Web information accessing, to overcome the limitations of browsing and keyword

searching, some researchers have been trying to use database techniques and build wrappers to

3

structure the Web data. Wrappers are interfaces to data sources that translate data into a common

data model used by the mediator. A wrapper is used to integrate data from different databases

and other data sources by introducing a middleware virtual database called as mediators.

Wrappers facilitate access to Web-based information sources by providing a uniform querying

and data extraction capability.

They support fast and efficient data extraction and are domain independent. In building

wrappers, it is necessary to divide the Web documents into different information chunks. If we

can get a semantic content structure of the Web page, wrappers can be more easily built and

information can be more easily extracted. Moreover, link analysis has received much attention in

recent years. Traditionally different links in a page are treated identically. The basic assumption

of link analysis is that if there is a link between two pages, there is some relationship between the

two whole pages. But in most cases, a link from page A to page B just indicates that there might

be some relationship between some certain part of page A and some certain part of page B

[Ashish 1997].

1.1 Document Object Model

In order to analyze a Web page for content extraction, we pass Web pages through an

open Source HTML parser, which creates a Document Object Model (DOM) tree, an approach

also adopted by Chen [Chen 2003].

The DOM is a standard for creating and manipulating in-memory representations of

HTML (and XML) content. By parsing a Web Page’s HTML into a DOM tree, we can not only

extract information from large logical units similar to Semantic Textual Units (STUs) but can

also manipulate smaller units such as specific links within the structure of the DOM tree. In

4

addition, DOM trees are highly transformable and can be easily used to reconstruct complete

Web pages. Finally, increasing support for the DOM makes our solution widely portable

[Buyukkokten 2001].

There is a large body of related work in content identification and information retrieval

that attempts to solve similar problems using various other techniques.

Finn et al. discussed methods for content extraction from “single-article” sources, where content

is presumed to be in a single body [Finn 2001].

The algorithm tokenizes a page into either words or tags; the page is then sectioned into

three contiguous regions, placing boundaries to partition the document such that most tags are

placed into outside regions and word tokens into the center region. This approach works well for

single-body documents. It destroys the structure of the HTML and does not produce good results

for multi-body documents where content is segmented into multiple smaller pieces, common on

Web logs. In order for content of multi-body documents to be successfully extracted, the running

time of the algorithm would become polynomial time with a degree equal to the number of

separate bodies, i.e. extraction of a document containing 8 different bodies would run in O(N8),

N being the number of tokens in the document. Kan et al. Similarly used semantic boundaries to

detect the largest body of text on a Web page (by counting the number of words) and classify

that as content [Kan 1998].

This method worked well with simple pages. However, this algorithm produced noisy or

inaccurate results handling multi-body documents, especially with random advertisement and

image placement. Rahman et al. proposed another technique that used structural analysis,

contextual analysis, and summarization [Rahman 2001].

5

The structure of an HTML document is first analyzed and then decomposed into smaller

subsections. The content of the individual sections is then extracted and summarized. Contextual

analysis is performed with proximity and HTML structure analysis in addition to “natural

language processing involving contextual grammar and vector modeling” However, this proposal

has yet to be implemented [Rahman 2001].

Kaasinen et al. discussed methods to divide a Web page into individual units likened to

cards in a deck. Like STUs, a Web page is divided into a series of hierarchical “cards” that are

placed into a “deck.” This deck of cards is presented to the user one card at a time for easy

browsing. He also suggests a simple conversion of HTML content to WML (Wireless Markup

Language), resulting in the removal of simple information such as images and bitmaps from the

Web page so that scrolling is minimized for small displays. The cards are created by this HTML

to WML conversion proxy. While this reduction has advantages, the method proposed in that

paper shares problems with STUs. The problem with the deck-of-cards model is that it relies on

splitting a page into tiny sections that can then be browsed as windows. However, this means that

it is up to the user to determine on which cards the actual contents are located, and since this

system was used primarily on cell phones, scrolling through the different cards in the entire deck

soon became tedious [Kaasinen 2000].

1.2 Color Code Model

Chen et al. proposed a similar approach to the deck of cards method, except that in their

case the DOM tree is used for organizing and dividing the document. They proposed by showing

an overview of the desired page. The user can select the portion of the page he/she is truly

interested. When selected, that portion of the page is zoomed into full view. One of the key

6

insights is that the overview page is actually a collection of semantic blocks that the original

page has been broken up into, each one color coded to show the different blocks to the user. This

provided the user with a table of contents from which user selected the desired section. While

this is an excellent idea, it still involved the user clicking on the block of choice, and then going

back and forth between the overview and the full view. None of these concepts solved the

problem of automatically extracting just the content, although they do provide simpler means in

which the content can be found. These approaches performed limited analysis of Web pages

themselves and in some cases information was lost in the analysis process. By parsing a Web

page into a DOM tree, we have found that one not only got better results but also had more

control over the exact pieces of information that can be manipulated while extracting content

[Chen 2003].

1.3 Vision Based Page Segmentation

The Vision-based Page Segmentation (VIPS) algorithm aims to extract the semantic

structure of a Web page based on its visual presentation. Such semantic structure is a tree

structure; each node in the tree corresponds to a block. Each node was assigned a value (Degree

of Coherence) DoC to indicate how coherent is the content in the block based on visual

perception, the bigger is the DoC value, the more coherent is the block. The VIPS algorithm

made full use of page layout structure. It first extracted all the suitable blocks from the html

DOM tree, and then it found the separators between these blocks. Here, separators denoted the

horizontal or vertical lines in a Web page that visually cross with no blocks. Based on these

separators, the semantic tree of the Web page was constructed. Thus, a Web page was

represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based

7

methods, the segments obtained by VIPS are much more semantically aggregated. Noisy

information, such as navigation, advertisement, and decoration was easily removed because they

were often placed in certain positions of a page. Contents with different topics are distinguished

as separate blocks.

The vision-based content structure of a page was obtained by combining the DOM

structure and the visual cues. Block extraction, separator detection and content structure

construction are regarded as a round. The algorithm is top-down. The Web page was firstly

segmented into several big blocks and the hierarchical structure of this level was recorded. For

each big block, the same segmentation process is carried out recursively until we get sufficiently

small blocks whose DoC values are greater than a threshold.

In VIPS, the data records in each segment are supposed to have the same degree of

coherence and the same level of depth. But the real Web pages may not have same level contents

at the same depth, e.g., querying www.yahoo.com (See Figure 1.2) using keyword “automobile”,

the returned results are at different levels. Not only does that happen, VIPS may partition data

record components into different neighboring blocks.

8

Figure 1.2 An example how VIPS segments a Web page into blocks

In Figure 1.2, the Web page is divided into two blocks VB1-1(4) and VB1-2(10) (VB

stands for “Visual Block”, and “1-1” is the block ID assigned by VIPS. The number inside the

parentheses is the degree of coherence of the block). The DoC is assigned based on the block’s

visual property. VB1-1(4) is further divided into VB1-1-1(8), VB1-1-2(4) because its DoC(4) is

less than the Permitted Degree of Coherence (PDoC) (10). The Block VB1-1-1(8) is further

segmented until the DoC is grater than 10 .this action is performed recursively until the condition

(DoC>PDoC) is satisfied.

VB 1-1-1(8) is shown in Figure 1.3. It mainly has a dropdown menu and a text form for

search query input. Figure 1.4 shows VB1-1-2(4). It consists of data records and external links.

9

Figure 1.3 shows VB 1-1-1(8)

Figure 1.4 shows VB 1-1-2(4)

The basic model of vision-based content structure for Web pages is described as follows.

A Web page is represented as a triple O,,. O 1, 2,..., N is a finite set of

blocks. Not all these blocks are overlapped. Each block can be recursively viewed as a sub-Web-

page associated with sub-structure induced from the whole page structure. . 1,2 ,...,T

is a finite set of separators, including horizontal separators and vertical separators. Every

separator has a weight indicating its visibility every separator has a weight indicating its

visibility, and all the separators in the same have the same weight. is the relationship of

every two blocks in O and can be expressed as OONULL. For example, suppose

10

i and j are two objects in O, i , jNULL indicates that i and j are exactly separated

by the separator i ,jor we can say the two objects are adjacent to each other, otherwise

there are other objects between the two blocks i andj [Cai 2003].

Figure 1.5 shows an example of vision-based content structure for a Web page of Yahoo

Auctions. It illustrates the layout structure and the vision-based content structure of the page.

Then we can further construct sub content structure for each sub Web page.

Figure 1.5 A sample web page segmented

11

Figure 1.5(a) Segmentation of Web page[Cai 2003]

Figure 1.5(b) An example of Web based content structure [Cai 2003]

For each visual block, DoC is defined to measure how coherent it is. DoC has the

following properties:

The greater the DoC value, the more consistent the content within the block;

In the hierarchy tree, the DoC of the child is not smaller than that of its parent.

In VIPS algorithm, DoC values are integers ranging from 1 to 10, although alternatively

different ranges (e.g., real numbers, etc.) could be used. We can pre-define PDoC to achieve

12

different granularities of content structure for different applications. The smaller the PDoC is, the

coarser the content structure would be. For example in Figure 1.5(a) the visual block VB2_1 may

not be further partitioned with an appropriate PDoC. Different application can use VIPS to

segment Web page to a different granularity with proper PDoC. [Robertson 1997].

The vision-based content structure is more likely to provide a semantic partitioning of the

page. Every node of the structure is likely to convey certain semantics [Gupta 2005].

For instance, in Figure 1.5 (a) we can see that VB2_1_1 denotes the category links of

Yahoo! Shopping auctions, and that VB2_2_1 and VB2_2_2 show details of the two different

comics.

1.4 Visual based segmentation

In the Vision Based Segmentation (VBS) algorithm, various visual cues, such as position,

font, color, and size are taken into account to achieve a more accurate content structure on the

semantic level. These visual cues are adapted from VIPS algorithm. Not all current segmentation

algorithms can determine the data regions or data record boundaries because they are not

developed for this purpose, but they provide the important semantic partition information of a

Web page. VBS is a top-down algorithm. It first extracts all the suitable nodes from the HTML

DOM tree, and then builds visual blocks from these nodes.

The following observations are made from the analysis of the algorithm over various Web pages.

1. Similar data records are typically presented in one or more contiguous region of a page,

with one major region containing most data records and several other minor regions.

Although there maybe some noise such as sponsored links or paid commercials in the

middle of a contiguous region, this type of noise usually had a very different visual

13

structure from the data records. In addition, usually there is more than one data record

before and after the noise.

2. Similar data records usually are siblings, and a leaf or terminal node is not a data record

because a data record can be further partitioned into more than one sub blocks in the block

trees. Although there are cases where similar data records may have different degrees of

coherence (DoC) and are at a different depth in the block tree, usually the depth gap is as

small as 1.

3. In block trees, a data record is usually self-contained in a sub tree and contains at least two

different types of blocks.

4. Usually data records located in different block sub trees (or data regions) have Different

block tree structure. In most cases, there is no need to compare the data records that are not

siblings.

14

2. NARRATIVE

The VBS algorithm used in this project used the DOM structure of a Web page. Visual

blocks are created from its DOM structure using visual cues and heuristics. The programming

language used to create the interface is c-sharp and it uses microsoft .net 3.6 as platform.

The input of the application is the html address of any Web page .The interface segments the

Web page in to visual blocks that contain data records. The process of visual block extraction is

explained in the following section.

2.1 Visual Block Extraction

In this step, we aim at finding all appropriate visual blocks contained in the current sub

page. In general, every node in the DOM tree can represent a visual block. However, some

“huge” nodes such as <TABLE> and are used only for organization purpose and are not

appropriate to represent a single visual block. In these cases, the current node should be further

divided and replaced by its children. Due to the flexibility of HTML grammar, many Web pages

do not fully obey the W3C HTML specification, so the DOM tree cannot always reflect the true

relationship of the different DOM node.

For each extracted node that represents a visual block. We judge if a DOM node can be

divided based on following considerations.

DOM node properties. For example, the HTML tags of this node, the background color

of this node, the size and shape of this block corresponding to this DOM node.

The properties of the children of the DOM node. For example, the HTML tags of

children nodes, background color of children and size of the children. The number of

different kinds of children is also a consideration [Cai 2003].

15

Based on WWW html specification 4.011, we classify the DOM node into two

categories, inline node and line-break node.

Inline node: the DOM node with inline text HTML tags, These tags affect the appearance

of text and can be applied to a string of characters without introducing line break. Such

tags include , <BIG>, , , , , , etc.

Line-break Node: the node with tag other than inline text tags. Based on the appearance

of the node on the browser and the children properties of the node, we give some

definitions:

Valid node: a node that can be seen through the browser. The node’s width and height are

not equal to zero.

Text node: the DOM node corresponding to free text, which does not have an html tag.

Virtual text node (recursive definition):

o Inline node with only text node children is a virtual text node.

o Inline node with only text node and virtual text node children is a virtual text

node.

Some important cues that are used to produce heuristic rules in the algorithm are:

Tag cue:

1. Tags such as <HR> are often used to separate different topics from visual

perspective. Hence, we prefer to divide a DOM node if it contains these tags.

2. If an inline node has a child that is line-break node, we divide this inline node.

Color cue: We prefer to divide a DOM node if its background color is different from one

of its children’s. At the same time, the child node with different background color will

not be divided in this round.

16

Text cue: If most of the children of a DOM node are text nodes or virtual text node, we

prefer not to divide it.

Size cue: We predefine a relative size threshold (the node size compared with the size of

the whole page or sub-page) for different tags (the threshold varies with the DOM nodes

having different HTML tags). If the relative size of the node is small than the threshold,

we prefer not to divide the node.

Based on these cues, we can produce heuristic rules to judge if a node should be divided.

If a node should not be divided, a block is extracted.

17

3. SYSTEM DESIGN

3.1 Analysis

This project developed an algorithm to extract a semantic structure of the Web pages using the

VBS algorithm. It creates a DOM tree and the Web page is segmented based on the heuristics.

Then all the noise such as advertisements, pop-ups are removed and the data records are viewed.

The data flow is illustrated in Figure 3.1

Generate

VBS

Figure 3.1 Overall dataflow

The VBS algorithm partitions the Web page using a set of heuristic rules that exceed the

performance of VIPS algorithm and offer better page segmentation.

The Web page in Figure 3.2 is given as input. A DOM structure is generated and then

given as input to VBS algorithm.

Web Pages DOM Tree

Segmented Web pages

18

Figure 3.2 A sample input Web page

3.2 User Interface

The user interface shown in Figure 3.3 has two input fields. The first one is the address bar

where the Website’s address is entered and the other field is a numeric field that indicates the

count of text of a visual block. The visual blocks with text less than this specified value are

considered as noise and eliminated. According to the algorithm employed, it shows the nodes

and branches of the tree created.

19

Figure 3.3 A sample VBS –partitioned Web page

Figure 3.3 shows the analysis of a Web page. It contains many data blocks, which

contains information about books. Each data block is a collection of image and various texts

which represent price, title, ISBN, etc.

Initially the whole Web page is partitioned according to the Visual blocks. In Figure 3.3

phrase VB represents visual block and the value in parenthesis represents the count of text in the

respective blocks. In figure 3.4 VB(3184) represents the header of the Web page. VB(6587)

contains the body of the Web page. It includes the data records. VB (6587) is further segmented

in to the data blocks containing information about the books. VB(1730) represents the footer of

the Web page that mostly contains noise.

20

Figure 3.4 User interface with segmented data records

Figure 3.4 shows the data records segmented from the Web page. VB(297) is the first

data record in the Web page it consists of image and text lines that give information about the

book. Similarly, VB(116) is a segmented data record.

The visual cues employed in VBS are as follows

Format tags include "B", "I", "A", "U", "STRONG", "BR", "EM", "CITE", "VAR",

"ABBR", "Q",

Special tags such as "DFN", "CODE", "SUB", "SUP", "SAMP", "KBD", "ACRONYM",

"FONT", "HR",

Text tags are represented by "P", "PRE", "SPAN",

List tags are identified by "UL", "OL", "LI", "DL", "DT", "DD"

Image tags are identified by "IMG", "MAP", "AREA",

21

Heading are represented by "H1", "H2", "H3"

The Figure 3.5 shows Visual blocks of a sample Web page that was segmented using VBS

algorithm. The analysis of the Html code started at the head of the Web page and traversed

towards the bottom of the page. VBS algorithm started to build visual blocks from the top of the

Web page .it first encountered a table with a single row and two columns. Since the table was

labeled as a visible and contained visual cues such as text cues, format cues and list cues it was

segmented as a single visual block and named as VB(837) and then it was further subdivided

using the VBS algorithm in to two visual blocks named VB(734) and VB(819) where VB(819)

represents the table column which contains department links. VB(734) represents a table column

which contains data elements as rows .The table rows which contain data records are further

subdivided in to visual blocks named VB(299),VB(116),VB(312).The conditions used to

determine a visual block from visible elements is explained below.

22

Figure 3.5 A segmented Web page mapped to Visual blocks

This project uses new heuristics in page segmentation to achieve a better data extraction. The

new heuristics employed improve the performance of the VIPS algorithm producing efficient

data extraction results.

The procedure that VBS uses to segment a Web page into visual blocks is as follows.

1. Start page examination from body element and obtain the suitable nodes

2. Build tree of elements

3. Next step is to recursively walk through the structure

4. Define a current visual block that represents current node

23

5. Walk through all visible child elements

6. For each visible element try to retrieve visible block

7. If block found and its text length is less than or equal to supplied Threshold value add it

to the current visual block

8. For each child visual block perform operation 3 through 7

3.3 Vision-based Content Structure for Web Pages

This Project identifies the basic object as the leaf node in the DOM tree that cannot be

decomposed any more. It uses the vision-based content structure, where every node, called a

block, is a basic object or a set of basic objects. It is important to note that the nodes in the

vision-based content structure do not necessarily correspond to the nodes in the DOM tree [Tang

1999].

The reasons causing incorrect data extraction and unacceptable segmentation of the Web

pages using VIPS are as follows.

1. VIPS translates data records as terminal nodes (leaf nodes) in the visual block Tree, This

case includes that multiple data records are in a single leaf node. In Figure 3.6, all the

data records are translated in to one leaf node that cannot be further divided. This

heuristic severely limits the capability of segmenting all data records. VB-1-2-2-2-2(11)

is the leaf node which has more data records inside it.

24

Figure 3.6 VIPS segments data records as leaf node

2. VIPS put wrong attributes in a node (e.g., put link length = 0 in a link node) or put

incomplete information in a node. It may cause wrong leaf node reduction. The same

reason as in above causes noise node cannot be identified and removed. A part of the

problem also involves the Web page designers who don’t pay enough attention in

specifying the information about the elements as tags. Figure 3.7 shows a Web page that

is segmented using VIPS and it highlights the node that is empty.

25

Figure 3.7 A Web page analyzed by VIPS which identifies noise as node

3. It is observed that the major contribution of noise is the noise on the edge of the Web

pages, and most of them are drop down lists, action buttons or text boxes. Figure 3.8

shows a sample Web page that has a dropdown list at the edge of the node.

26

Figure 3.8 Shows the drop down list which acts as noise

4. When identifying data records, the node type having the Highest Number of occurrences

can be a non-data record. The node that occurs frequently may be an ad or a text that is

repeated to draw attention. It is mistaken for Data record in VIPS. Figure 3.9 shows a

Web page segmented by VIPS that identifies noise blocks as data records because the

keyword books is found and because the blocks are of similar size they are identified as

data records.

27

Figure 3.9 A Web page segmented by VIPS that identifies noise as data records

5. Data records scatter through more than two levels in the block tree (complex Visual

structure). Figure 3.10 shows a complex visual structure where C denotes category, D

denotes data record and R denotes Related Content.

Figure 3.10 Complex Visual Structure

28

6. Occasionally similar blocks are not data records at all. Figure 3.8 shows that even though

the noise blocks are similar they are not data records.

7. A rare case is that VIPS splits a data record into different visual blocks. Figure 3.11

shows a Web page segmented by VIPS that segments a data record in to two data records.

Figure 3.11 A Web page analyzed by VIPS which identifies noise as node

After evaluating the performance of VIPS algorithm the following heuristics are

proposed, which improve the performance of the Algorithm with respect to visual

segmentation

The data record should be considered as a block by default, which eliminates the case of

multiple data records in a single leaf node. This is achieved in VBS algorithm by

assuming all the elements are visual blocks and then the elements are filtered out based

on the visual clues. Figure 3.6 shows a Web page segmented by VIPS where multiple

29

data records are in a single leaf node. Figure 3.12 shows the same Web page segmented

by VBS algorithm where it achieves better segmentation.

Figure 3.12 VBS segments data records

If a node has a higher number of occurrences then that node should be eliminated as a

non-data record, it is more likely to be noise component that usually repeats itself. In

VBS algorithm the number of occurrences is not considered to determine if the node is

data record or not. Because every Web page built in recent years features ads that

contribute to the income of the company or the individual, these ads are embedded in the

Web page repeatedly for maximum visibility. Hence, the number of occurrences is not

used to determine if block is data record or not. It leads to irregular data segmentation.

The Web page in Figure 3.8 is segmented using VBS in Figure 3.13 it shows that the

noise blocks are not segmented as data records and the whole noise block is considered as

a leaf node.

30

Figure 3.13 A Web page segmented by VBS that ignores noise

Data records shouldn’t be translated as terminal nodes instead a separator should be made

as a terminal node, which will accurately depict the visual blocks that include the data

records. VBS algorithm doesn’t implement separators hence this suggestion is for the

algorithms that achieve segmentation using separators.

31

4. EVALUATION AND RESULTS

In this experiment, we used three data sets to compare the performance of our VBS

algorithm to VIPS algorithm. The three data sets come from different resources. The first data set

(Data 1) is the Dataset 31 used by ViNTs. It is downloaded from

http://www.data.binghamton.edu:8080/vints/testbed.html. The second data set (Data 2) is

downloaded from the manually labeled Testbed for Information Extraction from Deep Web

TBDW ver. 1.022. TBDW holds query results from 51 search engines, and there are five query

result pages for each search engine. We only collect the first result page (1.html) of each search

engine. It is available http://daisen.cc.kyushuu.ac.jp/TBDW/. We gather the third data set (Data

3) from the home pages listed in the MDR paper (MDR paper does not provide the URLs of real

data it tested).[Zhai 2005].

The number of Web pages for each of the three data sets is shown in the third row of

Table 1. The performance measures we use to compare the three methods are recall = Ec/Nt and

precision = Ec/Et, where Ec is the total number of correctly extracted data records, Et the total

number of records extracted, and Nt the total number of data records contained in all the Web

pages of a data set.

32

Table 1 Evaluation Results

Data 1 Data 2 Data 3

VIPS VBS VIPS VBS VIPS VBS

# Web pages 41 41 46 46 50 50

# DRs 833 833 1004 1004 1086 1086

#Extracted DRs 1068 916 1130 962 1168 996

#Correct DRs 686 670 609 630 752 735

Recall 82.35 80.4 60.6 62.7 64.3 67.6

Precision 64.23 73.14 53.89 65.4 69.3 73.8

Table 1 shows the values of recall and precision achieved by both VIPS and VBS

algorithms. VBS has achieved a higher precision and recall in Dataset 1 and Dataset 2 where as

the results are close in Dataset 3.

The evaluation is based on the number of data records segmented by both the algorithms

and the correctness of the DOM tree generated by them. An additional comment about their

performance has also been recorded. The value of DoC for VIPS algorithm has been set to 10 for

evaluation purposes.

The correctness of a data record is based on the following rules:

1. A data record is correctly extracted if it only contains everything belonging to it and

nothing else. If some part of the data record is missing or the data record contains the

irrelevant content (e.g., a part of another data record), the data record is incorrectly

extracted. Therefore, nested data records such as N in 1. It is considered as incorrect in

our experiment.

33

2. The suggested search results, the most popular search results, or the sponsored links,

which often listed at the top of the result page, are not counted because they can usually

be found from the full results list or are irrelevant to the query. In addition, banners, the

advertisement-like item images, or the item categories are not considered as data

records.

3. Data records may come from multiple data regions in a result page rather than just from

one major region.

34

5. FUTURE WORK

The Visual based algorithm can be further improved by populating more heuristics that

can correspond to the new html tags that have been used lately such as embedded flash players,

streaming videos that also hold considerable amount of data. The performance of the algorithm

can be further improved by including another block in the user interface that would also indicate

the characteristics of the block and its elements. it would come handy in the process of

determining the status of the block whether it is a block of noise or a data record with a large

amount of data .

More emphasis on the block segmentation will also contribute to the accurate data

extraction. The new age Web developers are using modern technologies like ajax, java script,

adobe flex which have a different layout compared to the traditional html layout.

35

6. CONCLUSION

An automatic top-down, tag-tree independent and scalable algorithm to detect Web

content structure is presented. It simulates how a user understands the layout structure of a Web

page based on its visual representation. Compared with traditional DOM based segmentation

method, our scheme utilizes useful visual cues from VIPS algorithm to obtain a better partition

of a page at the semantic level. It is also independent of physical realization and works well even

when the physical structure is far different from visual presentation.

The produced Web content structure is very helpful for applications such as Web

adaptation, information retrieval and information extraction. By identifying the logic relationship

of Web content based on visual layout information, Web content structure can effectively

represent the semantic structure of the Web page.

Using proposed rules the visual segmentation capability of VBS exceeds VIPS. It also

reduced noise in complex data structures.

36

BIBLIOGRAPHY AND REFERENCES

[Ashish 1997]Ashish, N. and Knoblock, C. A., Semi-Automatic Wrapper Generation for Internet Information Sources, In Proceedings of the Conference on Cooperative Information Systems, 1997, pp. 160-169.

[Buyukkokten 2001] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke,“Accordion summarization for end-game browsing on PDAs and cellular phones” in Proceedings Of Conference on Human Factors in Computing Systems (CHI’01), 2001.

[Buyukkokten] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke, “Seeing the whole in parts: text summarization for Web browsing on handheld devices,” in Proceedings. of 10th international World-Wide Web Conference, 2001

[Cai 2003] Cai, Deng. VIPS: a VIsion based Page Segmentation Algorithm. MicrosoftTechnical Report (MSR-TR-2003-79), 2003.

[Chen 2003] Y. Chen, W. Y. Ma, and H. J. Zhang, “Detecting Web page structure for Adaptive viewing on small form factor devices” in Proceedings , WWW’03, Budapest, Hungary, May 2003.

[Embley 1999]Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp. 467-478.

[Finn 2001] A. Finn, N. Kushmerick, and B. Smyth, “Fact or fiction: content classification for digital libraries,” in Proceedings. of Joint DELOS–NSF Workshop on Personalization and Recommender Systems in Digital Libraries (Dublin), 2001.

[Gupta 2005] Suhit Gupta , Gail Kaiser “Automating Content Extraction of HTMLDocuments”World Wide Web: Internet and Web Information Systems, 8, 179–224, 2005

[Kaasinen 2000] E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski, and T. Laakko, “Two approaches to bringing Internet services to WAP devices,” in Proceedings of 9th International World-Wide Web Conference, 2000.

[Kan 1998] M.-Y. Kan, J. L. Klavans, and K. R. McKeown, “Linear segmentation and Segment relevance,” in Proceedings. of6th International Workshop of Very Large Corpora (WVLC-6), 1998.

[Mc Keown 2001] K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M. Y.

37

Kan, B. Schiffman, and S. Teufel, “Columbia multi-document summarization: approach and evaluation,” in Proceedings of Document Understanding Conference, 2001.

[Rahman 2001] A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from HTML documents,” in Proceedings of the1st International Workshop on Web Document Analysis (WDA2001), 2001.

[Robertson 1997]Robertson, S. E., Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp. 3-7.

[Tang 1999]Tang, Y. Y., Cheriet, M., Liu, J., Said, J. N., and Suen, C. Y., DocumentAnalysis and Recognition by Computers, Handbook of Pattern Recognition andComputer Vision, edited by C. H. Chen, L. F. Pau, and P. S. P. Wang WorldScientific Publishing Company, 1999.

[Zhai 2005] Zhai, Yanhong and Liu, Bing. Web Data Extraction Based on Partial TreeAlignment. WWW 2005, May 10-14, Chiba, Japan, 2005.

APPENDIX A. WEBSITES TESTING RESULTS

* For detailed description, please refer to section 4: Evaluation and Results, page 26.

38

**Testing results are on the next page.

39

Dataset 1 (MDR_DATA)Web page VIPS records found VBS records found Comments

Advanced Travel Portal

2 2 VIPS doesn’t detect the tool bar in the individual records

Amazon top sellers 25 25Asia travels 220 220 Both detect the same nodes in a different orderBarnes and Nobles 10 10Bookpool 10 10Buy HP 0 3 VIPS cannot display the web pageBuy Products Online 15 15Codys Books 20 20Comp Usa 8 8 VBS segmented records as rows Computers-Mama 15 15Costarica tours 61 61Discount cheap software

14 14

Ebay Plasma 50 50Find video games 15 15Fragrance Cosmetics 9 9Gaming 8 8Gifts under 25 21 20 VIPS finds a blank space as record which is not

desirableGPS Navigation 6 6Kadys Books 20 20Kids footlocker 19 19Kodak Easy Share 9 9 Same functionalityLow cost Domain 6 6Mapquest 0 0 No records availableLycos search 0 0 Script would not allow analysisNew Egg 12 12Overstock 24 24Product List 24 24 Similar analysisRadioshack 1 1 Treating all the data records as one recordSos store 22 22Shop lycos 9 9Software outlet 12 12Summer Jobs 12 12 VIPS Segmentation not accurate because of

large amount of text involved, U BID 17 17Waffles 10 10Eqarl The EMU 5 5Welcome to Streets 35 35World wide airport 0 0 No Data records in the page

But similar data blocks are recognizedYahoo Auctions 15 15

40

Dataset 2 Testing Results (TBDW_Testbed)

Web pages VIPS VBS Comments

1 20 202 12 10 VIPS detect a link as a text, cause neighboring

edge noise being detected as data recordsVBS detects them accurately

3 10 20 In VBS all the data records are under the same parent but where as in VIPS it has following errorsMissing 1: complicated situation, just one data record under his parent.Missing 5: actually under the incorrect data record 1-1-2-1-1-2-2-2Missing 4: actually under the incorrect data record 1-1-2-1-1-2-2-3, Missing1:complicated situation, just one data record under his parent.

4 25 25 VIPS splits data records into different sub-tree5 25 25 IN VIPS All data record nodes are links, after

reduced, left only leaf node, no ComparisonWhere as in VBS the links can be further analyzed

6 5 57 2 2 VIPS detect a link as a text, cause neighboring

edge noise being detected as data records.8 10 109 10 10 Wrong node type from VIPS. Complex tree10 74 74 VIPS segment all the data as leaves under a

single nodeVBS further segments the node

11 1 1 Only 1 data record12 10 1013 16 1614 10 10 VIPS segment all the data as leaves under a

single node In VBS each data record is a different node

15 12 12 Wrong node type from VIPS.16 15 1517 160 160 Wrong node type from VIPS. Tree is wrong

VBS tree structure shows all nodes in same level

18 8 8 Wrong node type from VIPS. Tree is wrong VBS tree structure shows all nodes in same

41

level19 50 50 Wrong node type from VIPS. Tree is wrong

20 10 1021 100 100 VIPS put all the data in a single leaf

VBS partitions all the records22 10 10 Wrong node type from VIPS.23 10 Please double check the page, I think there are

> 10 DR24 10 Wrong node type from VIPS25 25 Most data record nodes are links, after reduced,

left only leaf node, no comparison26 10 1027 12 12 All similar blocks are treated as Data blocks in

VIPS28 5 10 VIPS doesn’t detect blocks from 6-1029 94 9430 0 0 Data records cannot be extracted because all

the records are under the same level under single node.

31 2 10 All similar blocks are treated as Data blocks in VIPS

32 5 533 10 10 Wrong node type from VIPS. Tree is wrong34 10 10 Wrong node type from VIPS. Tree is wrong35 9 10 First 3 data records found by VIPS are very

different36 10 1037 25 2538 10 1039 6 640 10 10 Wrong node type from VIPS. Tree is wrong41 20 20 Wrong node type from VIPS. Tree is wrong42 10 1043 0 25 Data records are leaves not internal nodes44 2 20 All data records are similar to noise45 10 1046 15 1547 1 148 10 1049 13 1350 10 10 VIPS put all data records in single leaf51 14 14 Large about of noise undetected in VIPS

43

Dataset 3 (VINTS_DATA)

Web page VIPS VBS Comments

Agents 36 38 VIPS treats each data record as a leaf

Alphabets 10 10 Not enough information from VIPSAlpha works 9 9 Each data record itself is a leaf nodeAmazoid 6 6 VIPS did not treat a data record less than one internal node. Amazon 50 50AW 19 19Barnes 9 10 Incorrect 2: noise on the edge. Missing 17: nearly each level has

one data record.Book Buyer 28 28Bookpool 25 25 Missing 1: 1-1-3-1-2-2+1-1-3-1-2-3. It's on the different level

with other 24 data record. These 24 data records range from 1-1-3-2 tp 1- 1-3-25

Borders 50 50 Not enough information from VIPSCanoe2 9 9 VIPS treats each data record as several Link Type leaf nodesCanoe 20 20 the similarity threshold is not high enoughCbc customers 7 7Chapters 20 20 VIPS treats all 20 data records as a leafCnet2 15 15Cnet games 5 5 The real data record actually are 1-2-2-2-1-1-1-4 and 1-2-2-2-1-

1-1-6. Missing 3: because each of these 3 data records itself is a link type leaf node.

Cnet tech 15 15 VIPS did not mark the data record on the same level.

Cody 20 20Dwjava 20 20 Missing 1: 1-2-2-4-1-2, it is under the different parent node

with other data records, and under it's parent, there is only one data record itself, it belongs to the vsdr complicated situation. Incorrect 6: using leaf node reduction, make unsimilar data record seem similar

Dwxml 16 20Ebay 3 3 Not enough information from VIPSEtoys 9 9 VIPS did not mark right. So can not delete the big link noiseExcite 0 15 incorrect 2: noise on the edge. Missing 10, these 10 data records

actually under the incorrect 2 data records,1-2-2-3-2-2-1 and 1-2-2-3-2- 2-2, 5 under 1-2-2-3-2-2-1 and other 5 under 1-2-2-3-2-2-2. Because 1- 2-2-3-2-2-1 and 1-2-2-3-2-2-2 both have leaf node. So they are marked as data records. Missing 5. each data record itself is leaf node.

Fat brain 0 25 VIPS mark link type as text type and then vsdr use leaf

44

reduction. Incorrect 5: noise on the edge and VIPS mark not detailed enough, and make 1-1 and 1-2 similar.

Gamecenter 35 35 incorrect 1: leaf reduction , make unsimilar be similargamelan 2 10 Missing 10: VIPS mark link type as text type and then vsdr use

leaf reduction. Incorrect 2: noise on the edge

Google 0 10 missing 10: VIPS mark link type as text type and treat each data record as one leaf. Incorrect 10 noise on the edge

Goto 40 40 VIPS does not mark right and similarity threshold is not high enough.

Hotbot 10 10Ibm 4 4 invalid characters Infoseek 10 10Itn 0 10 Missing 19: VIPS mark link type as text type,King 0 19Lc 20 20Lycos 13 16 Missing 4: VIPS even does not generate correspending ID for

these four data records. Missing 10: VIPS mark link type as text type and then vsdr using leaf reduction make data record as a text leaf. Missing 3:each of these 3 is actullay a combination of each two of incorrect 6. 1-5-2-1-1-1-1+1-5-2-1-1-1-2 actually is a data record. 1-5-2-1-1-2-1+1-5-2-1-1-2-2 is a data record. 1-5-2-1-1-3-1+1-5-2-1-1-3-2 is a data record. Incorrect 4: noise on the edge

Magazine outlet 12 12msn 0 50 Missing 10: VIPS marks other types as text type.Powells 7 8 incorrect 4: noise on the edge.missing 1: 1-2-2-2-2-2, it is under

the different parent node with other data records, and under it's parent , there is only one data record itself, it belongs to the vsdr complicated situation

Quote 10 10Rubylane 25 25 VIPS does not give the enough information, Signpost 12 12 VIPS does not give the enough information, Thestar 50 50Vancouverson 0 4 VIPS treat each data record is a leaf nodeVunet 0 10 VIPS mark link type as text type Incorrect 2: noise on the edgeWine 10 10Yahoo2 4 20 Missing 16: VIPS mark link type as text type and then vsdr

using leaf reduction make data record as a text leaf. Incorrect 5: noise on the edge

Yahoo 0 17 Most data records are leaves. Others are reduced in leaf node reduction.

Yahoo Auction 0 50

45

Zbooks 28 30 Missing 2: VIPS treats each of them as a leaf node. Incorrect 1: VIPS does not give the enough information, VBS detects all the data records

Zshop 50 50