querying web pages with database query languages · querying web pages with database query...

Querying Web Pages with Database Query Languages

by

Xiaoyu Yang

Graduate Program in Cornputer Science

C. .Ln:u- f :- ,rc.cc=rl C.1Cll- ,,c 3 u u l 1 I 1 L t G u lu pQL LlQt LLlLLlllllltllL

of the requirements for the degree of

Master of Science

Faculty of Graduate Studies

The University of Western Ontario

London, Ontario

November 1998

Q Xiaoyu Yang 1998

National tibrary Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. rue Wellington Ottawa O N K1A O N 4 OttawaON KlAON4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distn'bute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de

reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent ê e imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

ABSTRACT

As the World Wide Web is growing at a phenomenal rate, it becomes more and more

difficuIt to retrieve information of interest from the enonnous number of resources that are

avaiiable. Currently, there are two ways to retneve information fiom the Web, namely,

navigationhrowsing and searcbg by search engines. However, these search methods

have significant limitations, such as, the "lost-in-hypenpace" phenomenor, the ignorance

of the hypertext structure, etc. These drawbacks motivated the development of a flexible

and powerfiil web query system.

This thesis presents a prototype system developed to query the Web with database

query languages. Ln our prototype system, the Web is modeled as a labeled directed graph

which can be stored in a relational database. A paner was designed and hplemented in

Our prototype system to extract the information of a web page 5om the source HTML fle

and store it into the database. Three query facilities are developed in the prototype system,

namely, the content query, the structure query and the advanced query, which can be used

to pose queries on both the content and the hypertext structure of web pages. Extensive

-experiments have been perfonned to test the prototype system. The testing results show

that database query languages can be used successfully in quemg the Web.

ACKNOWLEDGMENTS

I wish to express rny rnost sincere gratitude and appreciation to my supervisor, Dr.

Sylvia Osborn, for her time, invaluable guidance, support, understanding and

encouragement during the course of this work.

Thanks are also extended to al1 my niends and Wow graduate students for their

suggestions, encouragement and for the fiiendly environment they provided.

Most of aU, 1 would Wce to thank my parents and my husband. Their love and support

are invaluable,

TABLE OF CONTENTS

CERTIFICATE OF IGCAMINATION

ABSTRACT

AcKNOFYLEDGiMENTS

TABLE OF CONTENTS

LIST OF FIGURES

LIST OF TABLES

Chapter 1 Introduction

1 -1 Motivation

1 -2 Objectives ? - s - l r .=, 1 nesis S r g ~ ~ o n

Chapter 2 An Overview of Hypertext and the World Wide Web

2.1 An Introduction to Hypertext

2.1.1 A Bnef History of Hypertext

2.1.2 Hypertext Concepts

2 2 The World Wide Web

2.2.1 A Brief History of the World Wide Web

2.2.2 The World Wide Web Concepts

2.3 HyperText Markup Language (HTML)

2.3.1 A Brief History of HTML

2.3 -2 Common HTML Tags

2.3 -3 Examples

2.4 Searching the World Wide Web

2.4.1 Navigationlsrowsing

2.4.2 Searching by Search Engines

ii

iii

iv

v

viii

ix

Chapter 3 Related Work of the Web Querying

3.1 Issues of Queqing the Web

3.2 Modeling the World Wide Web

3.2.1 Object Exchange Model (OEM)

3 2.2 Araneus Data Model (ADM)

3 -3 Web Querying Systems

3.3-1 WebSQL, W3QS, WebLog

3.3.2 Wrappers Used in Querying the Web

3 -3 -3 Surnrnary of Web Querying Systems

3.4 A Review of Database Query Languages

Chapter 4 The Prototype Web Querying System

4.1 Data Model

4.2 A. Re!atimA V-ew ~f W E ! ~ U-de Weh

4.3 System Overview

4.4 Mapping HTML Fdes to the Database

4.4.1 Extracting Information fiom HTML Files

4.4.1.1 Extracting title Infornation

4.4.1 -2 Extracting Hyperlinks and Description

4.4.1 -3 Extracting Other Information

4.4.2 Storing the Information in the Database

4.5 Query Facilities

4.5.1 Content Query

4.5 -2 Structure Query

4.5.3 Advanced QUery

4.6 User Interfaces

4.7 Suppiementary Functions

4.8 Summary

Chapter 5 Querying the Web Pages

5.1 Query Methods

5.1 - 1 Content Query

5.1.2 Structure Query

5.1 -3 Advmced Query

5 -2 Experimental Resdts

5.2.1 Content Query


5 -2.3 Advanced @ery

5.3 Summary

Chapter 6 Discussions and Future Work

6.1 Discussions

6.2 Future Work

References

Viîa

vii

LIST OF FIGURES

Figure 2.1

Figure 2.2

Figure 2.3

Figure 2.4

Figure 3.1

Figure 3.2

Figure 3.3

Figure 3.4

li.:m.vn 2 C A 16U1 i.W

Figure 4.1

Figure 4.2

Figure 4.3

Figure 4.4

Figure 4.5

Figure 4.6

Figure 4.7

A SampIe Hypertext Structure

Common Tags in HTML

A Sample HTML File

A Sample Web Page

An OEM Graph

A Sample ADM Scheme

A Sample Web Page

The Architecture of the WebSQL System

A C m m r r l a A r r r h C + a m + v . r a - C A Kar l :n+rr - --ri lXT-n-.i-.-.- c L 3-p~r c u CIILIL-WL- e V A A r a w L u L w L auu T V L a p p ; 3

An Example of Labeled Directed Graph Mode1

A Sample Web Page for Course Description

The System Architecture

The Query Interface for Content Query

Results of a Content Query

The Query Interface for Structure Query

Results of a Structure Query

Figure 4.8 The Query Interface for Advanced Query

Figure 4.9 Results of an Advanced Query

Figure 4.10 The User Interface for Accessing Query Facilities

Figure 4. I t The User Interface for Supplementary Functions

LIST OF TABLES

Table 4.1 Refation webpage with 1 Tuple

Table 4.2 Relation webpage-d with 1 TupIe

Table 4.3 Relation links with 13 Tuples

Chapter 1

Introduction

1.1 Motivation

Since its creation in 1989, the World Wide Web has been growing at a phenorneml

rate. As a globai information resource residing on the Intemet, the World Wide Web

contains a large amount of data relevant to alrnost ali domains of human activity:

education, business, entertainment, art, science, politics, religions, etc. There are currently

tens of d o n s of documents on the web, and the number is growing. The explosion of

the World Wide Web, on the one side, is providing more and more information; on the

m t h m t &Ao, LC\~~~LLI~LLP ;+ -le-- m R . . e a ~ ~ ~ n h 1 a - e r\-a G . ~ A C ) ~ M . . + O I ~ ~ n C . 1 - -<+CI +Le Vv-eL :c- U U W L 3 L U W 7 L A V I V W T W A ~ L L GU* W U U w ~ L V U & W L L W r V L I W AULLU-WLlbLU ~ I V U L b L L & W l i l L L L l b W b U L 3

the difficulty of retrieving specïfïc information of interest to the user, fiom the enormous

number of resources that are available Cl]. The most cornmon technology used for

searching the Web is based on browsing the web pages by following links or searching by

sending information retrieval requests to "index servers" [2]. While navigation and

browsing are useful, they can lead to the weil-known "lost-in-hyperspace" phenomenon.

On the other hand, limitations aIso exist in using search engines to seek out the

idormation f?om the Web. One major limitation of search engines is that they provide only

keyword search, which means this kind of search cm not effectively use the way

information is structured in Web documents [3]. Therefore, complex queries regarding

hypertext structures are not dowed. For example, there is no way to find out the

hypertext Links of interest within a given web page by using search engines. Under these

circumstances, a flexible and powerfid search systern for querying the Web is really

needed.

1.2 Objectives

The main objective of this thesis is to design and implement a prototype system,

which can be used to query web pages with database query Ianguages. In order to

overcome the major limitation existing in search engines meneioned in section 1.1, Our

prototype system should have the ability of supporting structure queries, which are quenes

posed on the hypertext structures of the web pages. However, a new issue arises by

adding this facility to our prototype system, Le., how to represent the structure of the web

pages. Thus, one of the objectives of this thesis is to build up a data mode1 that is

comprehensive enough to capture some important aspects invohed in querying the Web.

DBerent fiom other web querying systems, our prototype system exploits an existing

database query language to query the web pages. Accordingly, another objective of this

thesis is to explore the beuefit that we can get from using the database query language to

query the web pages.

1.3 Thesis Organization

This thesis consists of six chapters.

Chapter 1 provides an introduction to the thesis.

Chapter 2 gives an overview of Hypertext and the World Wide Web. HyperText

Markup Language and the methods currently used to search the web are also introduced.

Chapter 3 introduces the issues regarding querying the Web and presents severai

related works for querying the World Wide Web.

Chapter 4 focuses on the design and irnplernentation of the prototype system. In this

chapter, the data mode1 used to mode1 the structure of the web pages is descnbed- A

system overview is provided. The method of rnapping fiom E3TML files to the database

and the query faczties developed in this prototype systern are aiso introduced.

Chapter 5 provides and discusses the experimentd results obtained &om three

different types of queries, i-e., content query, structure query and advanced query.

Chapter 6 concludes this thesis and offers recommendations for fùture work-

Chapter 2

An Overview of Hypertext and the World Wide Web

The World Wide Web can be considered as a huge hypertext system on the Internet,

where the hypertext nodes are simply HTML files residing on the £ile systems of certain

Internet hosts. This chapter provides an overview of the hypertext concept and an

introduction to the World Wide Web.

2.1 An Introduction to Hypertext

Hypertext is text with links. It dEers f?om traditional text in providing quick access &- -&a.-- - ..A- ," ,,,,=k ida& t ~ i ? h i r i t& is~? ~ùi~efi1ly behg ïe&. Kypeï~exi sir~c:iurt: is the

fùndamental structure of the World Wide Web, by which Web documents are organized.

2.1.1 A Brief History of Hypertext

Hypertext has a surprisingly rich history cornpareci to the World Wide Web. The first

system we wouId now describe as a hypertext system was proposed by Vannevar Bush as

- early as 1945. This system, the Memeq was never irnplemented, but was only described in

theory in Bush's paper [23]. It was described as "... a device in which an individual stores

his books, records, and communications, and which is mechanized so that it may be

consulted with exceeding speed and flexïbility." The actual word "hypertext" was coined

by Ted Nelson in 1965. Nelson was an early hypertext pioneer with his Xanadu system,

which he has been developing ever since. Parts of Xanadu do work and have been a

product &om the Xanadu Operating Company since 1990. The basic Xanadu idea is that

of a repositov for everything that anybody has ever written, giving a truly universal

hypertext systern. Nelson views hypertext as a literary medium and he beiieves that

"everything is deeply intertwingled" and therefore has to be on-line together. A final event

was the extremely rapid growth of hypertext on the Internet in the mid-1990s,

spearheaded by the specification of the World Wide Web by Tirn Bemers-Lee and

colïeagues at CERN (the European Center for Nuclear Physics Research Ui Geneva,

Switzerland). Detailed information about the history of hypertext can be found in [4].

2.1.2 Hypertext Concepts

The sirnplest way to defhe hypertext is to compare it with traditional te*. Al1

traditional text is sequentiai, Le., there is a single linear sequence defïning the order in

which the text is to be read. Generaliy, when we read a book, we read page one first, and

then page two, and then page three, and so on. Hyperîext, however, is non-sequential, that

is there is no single order that determines the sequence in which the text is to be read.

üsuaiiy, hypertext presents several different options for readers to explore rather than a

single Stream of information.

Figure 2.1 : A Sample Hypertext Stmcture

Figure 2.1 illustrates a sample hypertext structure. In this figure, 4 B, ... , F represent

units of information, which are calleci nodes; - represents an anchor, and-* represents

a W. Each of the nodes may have pointers to other units, and these pointers are called

h k s . Links provide the mechanism whereby nodes are connected to one another. The

node f?om which a iink originates is cded the reference. Points within the reference

where Iînks are dehed are referred to as anchors- The node at which a Iink ends is c d e d

the refeent [4]. As can be seen, the entire hypertext structure forms a network of nodes

and links. Readers move about this network in an activity that is oflen referred to as

br~ws~hg or nawzgmhg to ernphasize that users must actively determine the order in which

they read the nodes. For example, if a reader is currently reading node 4 the next node

the reader can choose is B, D or E. If the reader selects B, then the reader has alternatives

to either read ail the text in node B or jump to node C or F, and so on.

2.2 The World Wide Web

TL, -,- ,:cl 1-- ..-, C- t - - - ~ - , - r r 1 uruaL w ~ u ~ i y us& 5 ya~t ; l l ~ LUI 11ypt;l LGXL i~ tkii- 'A'ùi:d 'A~US W&. ii is a is0 une OL

the newest Intemet services. The World Wide Web has the ability to combine text, audio.

video, graphies, etc. together. Its hypertext structure provides quick access to other

related Web documents. Now, the World Wide Web is emerghg to be the newest and

most exciting tool for locating and displaying information on the Internet.

2.2.1 A Bnef Kistory of the World Wide Web

The history of the World Wide Web is fairly short. It was developed at CERN in the late

1980s. The purpose of the World Wide Web was to d o w anyone at CERN to easily

access and display documents that were stored on a server anywhere on the Internet* By

the end of 1990, the researchers at CERN had a text-mode browser and a graphical

browser for the NeXT cornputer. During 1991, the World Wide Web was released for

general usage at CERN. Initidiy, access was restricted to hypertext and UseNet news

articles. As the project advanced, interfaces to other Internet seMces were added, such as

WAIS, anonyrnous FTP, Teinet, and Gopher. In 1992, the World Wide Web project was

made public. People began to create their own Web servers to make their information

avaiiable to the intemet and to design easy-to-use interfaces to the World Wide Web. By

the end of 1993, browsers had been developed for many diEerent computer systerns,

including X Wmdows, Apple Macintosk and PCMrmdows. By the summer of 1994, the

World W~de Web had become one of the most popular ways to access Internet resources.

2.2.2 The World Wide Web Concepts

The World Wïde Web uses a client-server architecture for distributed hypertext that

can be accessed over the Internet. Servers run specialized software, cded H T P D

(HyperText Transfer Protocol Demon), which accepts requests that arrive over the

network, performs a fiuiction in response to that request, and then returns the results to

the requester. Servers are also regarded as a collection of Web documents including

hypertext files, image, video clips, sound files, etc., which c m be shared over the World XXCJ- vv ws x s r . . ~ v v GU. Aï execùtiig pîüm?i kûiiiu â " d i ë ~ t ~ ' whez ii i~ âbk ic ~ ~ i i d à TWÜSST iû P

server, await a response, and process that response [4]. A Web browser is such a client

that knows how to interpret and display documents that it finds on the World Wlde Web;

exarnples are Netscape and Microsoft Internet Explorer. AU the servers provide their data

to the client software in a standardized format cded H l U L (HyperText Markup

Language) through a standard communication protocol cded HTIiP OIyperText Transfer

Protocol). This combination of HTh4L and HTTP constitutes the hypertext abstract

machine and is the only point at which client and server computers need to agree.

The World Wide Web has a standard way of referencing a document by using a

Uniform Resource Locator (UEU,), no matter what the document's type is, for example,

text, sound file, etc. A URL is a cornplete description of a document, containing the

location of the document you want to retrieve. The location could be on your local disk or

on an Internet site halfway around the world. A URL can be set up to be absolute or

relative. An absolute URL contains the complete address of the document that is being

referenced, including the host name, directory path, and fila name. The forma1 syntax of an

absolute URL is:

where <Protocol> is a protocol that the Web browser can use to retrieve documents,

such as http, ftp, gopher, news, mail, etc.; <Most> is the server name; <Path> is a Unix-

style path for the file; <Filename> is the actuaI file name and <Locution> is a textual

label in the me. For exarnple, the foilowing URL is an absolute URL:

Protocol Host Path Fie Name Location

However, if the destination document is on the same Web server as the source document,

a relative URL may be used. A relative URL omits the protocol and host, or even the

patb that is, a relative URL onlv - s~ecifies - the subdirectory if a d a b l e and the file name.

The foiiowing example illustrates a relative URL that can be found in our example HTML

file in Figure 2.3.

Subdirectory Fie Name

What is worth mentionhg is that the equivalent absolute URL cm dways be constructed

fiom the URL of current document and the relative URL in the current document.

2.3 HyperText Markup Language (HTML)

HTML is the language used when writing a document that is to be displayed through

the World Wide Web. It will be apparent nom the example in Figure 2.3 that HTML is a

fairly simple markup language that describes how a document is stmctured. It is therefore

easy for people to w-rïte HTML files for distribution over the World Wide Web, and this

simplicity has been one of the factors in the success and growth of the World Wide Web.

2.3.1 A Bnef History of HTML

HTML was originally deveioped by Tim Bemers-Lee while at CERN, and

popularized by the Mosaic browser developed at National Center for Supercomputing

Applications (NCSA). During the course of the 1990s it has blossomed with the explosive

growth of the World Wide Web. In 1994, HTML 2.0 was developed to c o d e common

practice. HTML+ (1993) and HTML 3.0 (1995) proposed much richer versions of

HTML. In 1996, the efforts of the World Wide Web Consortium's KIUL Working Group

to codG common practice resulted in EITML 3.2. Now, HïML 4.0 is the latest version

with more powerfid and mature features.

2.3.2 Common HTML Tags

HTML is an application of SGML (Standardized General Markup Language) [4]. It

defines a collection of tags that can be used to publish on-he documents with headuigs,

texts, tables, lists photos, etc. and to retrieve on-line information via hypertext links.

These tags also provide a rneans to enable images, sound and even animation to be

- embedded in Web documents and to design foms for conductuig transactions with remote

services, for use in searchg for information, making reservations, ordering products etc.

Figure 2.2 contains some of the commonly used tags in an HTML document.

The most important feature of HTML is its ability to insert hypertext Links into an

HTML document so that other HTML documents can be Linked by these Links. Hypertext

Links in an HTML file are pointers fiom keywords appearing in the document to a

destination. The destination could be another HTML document or a resource such as an

externai image, a video clip, or a sound me. HTML supports hypenext Links through an

anchor tag <A> in the form oE

where HREF stands for Hypertext REFerence; DestClRL is the llRL of the

destination document and AnchorText is the text to appear as an anchor when the

document in which this hypertext link is defined is displayed by the Web browser.

2.3.3 Examples

Basicaiiy, an HTML document consists of two parts: the head and the body. The head

contains meta-information about the document. It is specified using the tag <title> ...

e/+;+la\ r\r + m e /--+.-.\ TL- f -A,- ,---:-- - t t * fa A:--I---LI- ,--~-,a 7- - -, -r V* '-5 - , ~ A ~ L Q ( . A UG U V U ~ b U I l L Q U 1 3 au LUC unpayauuz GC(ZICI;IL uscr~ c~an navigaie

over the various documents by activating the hypertext Links of interest. The display that is

the result of vieming an HTML document using a browser is caiied a page. Figure 2.3 and

2.4 show a sample HTML fde and the comesponding file displayed in a Web browser.

Page Markup

Hypertext Links

Inline Images

Form EIements

Table Elements

List Elements

Structurai Markup

Style Markup

Document titIe

HTML document definition

Body section definition

Head section definition

Hyperlink (anchor) definition

Image

FU-Out forms

Input element

Table definition

Standard table data cell

T&!e henr'el- ce!!

Table row

Unordered list

List item

Menu List

Paragrap h break

Line break

<h 1 >. . .<h6> . . -4h 1 >. . .</h6> Level 1 -6 headings

ci> .. . Italics styIe

 . . . Strong emphasis style

- - - - ---- - - -

Figure 2.2: Cornmon Tags in HTML [4]

e - m M D

C HE AD>

-Xiaoyu Yang's Home Page d ï i i U 3

<BODY BGCOLOR=Vnma",

<CENTERxIMG SRC="we1comeee.gif" ALZGN=middlexBRxBRxBR.

</FONIkBRxBR>

CIMG SRC=*grapegrapevMe.gif" ALIGN=middlexBRxBRxBR>

qAl3I.E BORDER="Ow CELLPADDING=" 1 0 3

VALIGN="middlewI%+

<A HREF="persoxxal.html~LMG SRC=" l384.gif" HSPACE=15 BORDER=@

6 T R O N G > P e r d uiformation dSTRONGWA><BR,

<A HREF="re~earch.html3<IMG SRC=" 1384.gif" HSPACE=15 BORDER=@

Research Work </AxBR,

<A HREF="/faculty/sylviah~~MG SRC=" l384.gif" HSPACE=15 BORDER=@

6 T R O N e M y Supenrisar dSTRONG></A> 

<A HREF="courses.htmiS<IMG SRC=" 1384.gïf'' HSPACE=15 BORDER=@

6ïRONG>Courses </AxBR>

CA HREF="tahtd"xIMG SRC=" 1384.gif" HSPACE=I 5 BORDER=O>

eTRONG>TA 4STRONG>(/AxBR>

<A HREF="in&est~.html3<IMG SRC=* 1384.gif" HSPACE=15 BORDER=@

4TFtONGXnteresting Links </STRONGWA>

-><IrABLExBR>

CIMG SRC="gapepevine.gif" ALIGN=middlexBRxBR><BRxP>

<A H R E F = " h t t p i / m . c s d u w o . c a ~ ~ e n t of Cornputer Scien&A> (

<A HREF="lunages/Middlesexhtml~Middl~x CoiiegedAxBR>

<A HREF="httpJ/www.uwo.ca.>The University of Western Ontarioc/A> 1 <A HREF="httpi/www.city . londoa01~ca/sLondon4 1 <A HREF=*http-I/~~~~gov.onca/s(hitario4A~ 1 cA HREF="httpJ/canadag~.ca/"I:anada4A> <BRxBRxBR>

This page has been accessed

CIMG SRC="http~~-me£asa£asacom~cgi-bin/nph£omt?width=6&~http~/~.csduwo.ca/gradstudents

/students/xyang/index.html~ since Juiy l,l998.

</CENTEIixHRxBRxBR>

CIMG SRC="painting.gif"> on February 18,1998.4WEONT>

(IBODY>

4HTML,

Figure: 2.3 A Sample HTML File

to Xiaoyu Yang's Homepage

Personal Information Research Work MY Supervisor Courses - TA - Interestinpr Links

Figure 2.4: A Sample Web Page

2.4 Searching the World Wide Web

One of the real advantages of the World wde Web system is that ordinary users cm

create web pages that users anywhere on the Intemet can display. This feature aiiows

ordinary users to publish information that can be used by the entire world and also results

in the rapid growth of the amount of information available on the World Wilde Web. There

are software programs, graphies, magazine articles, job postings, govemment reports,

weather maps, and thousands and thousands of documents, and so on. Therefore, it is not

aiways easy for users to find what they want - or to even know how to h d it. Currently,

there are two main methods that are used to search the information of interest:

navigatiodbrowsing and searching by search engines.

This is an excellent method of locating information which a user may not have

considered available on the World Wide Web. It involves starting somewhere and just

following the Links. This is a simplest way to fhd information, but is not a reliable rnethod

to End a particular piece of information on the World Wide Web. As mentioned in Chapter

1, readers often experience the "lost-in-hyperspace" phenornenon when navigating the

World Wide Web.

A solution to the navigation problem is to provide users with classSed directories

[14], which can guide users to useful resources on a particular subject or of a particular

type. An excellent exarnple of this is Yahoo, which d o w s the users to search through its

hierarchy. Since documents of a sirniIar subject/type may be grouped together, this method

obviously can narrow the scope of a search. Navigation by classified directories, however,

is sometimes Limited by the documents available and there is stiU a risk that users may

becorne disonented or have trouble to fhd the information they need.

2.4.2 Searching by Search Engùies

The need for retrievhg information Eom the World Wide Web has led to the

development of a number of search engines. They search the Web accordhg to the

keywords or phrases s p d e d by the user and retum the resuits which are relateci to the

keywords or phrases. Typically, search engines are cornposed of a resource locator (also

known as robot) and a search interface. Searches are based on an index database which

stores the information of the web pages. The resource locator is run penodicdy to gather

information f?om the Web and create and update the index database. The search interface

takes a user query, passes the request to the Web semer, which performs a retrieval fkom

the index database and rehirns the results. The results appear in hypertext and can

immediately be selected to link to the required documents.

.fm".. &xt"r2i~ 5Mr2h eTi&îca "AL' A:CL"--a ----L:lL2-- --'A L v ~ ~ ~ ~ * r u 1 u c i c i l ~ bapaumun cmar ûfi GiEeicnî se[-vers.

Some of the search engines are: Alta InfoSeek, Lycos, etc. Although there are a

large number of search engines available on the Web, they are all used in exactly the same

simple way: type in some text, and get back a hypertext answer which points to things that

were found by the search. It is an easy and usefùi way to search the World Wide Web by

search engines. However, as Martijn Koster says in his article [6], robots "wiU become less

effective and more problematic as the Web grows". The major Limitation of search engines

is that they support only keyword search, which means it is impossible to pose query on

hypertext structures. For example, assume we know the URL of the home page of

Computer Science Department at the University of Western Ontario and we would iike to

be able to restrict the search to only pages directly or indirectly reachable fiom this page.

With currently available search tools, this kind of query is not possible. There are also

some other drawbacks with search engines, such as they cannot be adapted easily to the

requirements of a specific user, their query language is poor and they ofien return too

many answers, badly ordered and rnainly irrelevant. These Limitations stimulate the need

for new and powerful search tools.

Chapter 3

Related Work of the Web Querying

The WorId Wide Web is a distributed, ever growing, giobal information resource. The

rapid growth of the Web makes the wealth of information become more and more dficult

to mine. Cumentiy, there are ody two ways to search the information available on the

Web: navigationhrowsing or searching by search engines. These two methods, however,

have important limitations, as stated in Chapter 1. Thus, the situation here is we have an

invaluable information resource, but c m not use it effectively. This compelling need for

querying the Web in a flexible and powerfbi way has led to the development of a number

of new web querying languages and systems. In this chapter, we focus on the related work

in the area of querymg the Web. First, the important issues and the f icu l t ies existing in

queqing the Web are presented. Then, data models used to describe the Web are

introduced. FmaIly, several on-going Web queryïng projects are discussed.

3.1 Issues of Querying the Web

One important and fundamental issue of querying the Web is the design of a data

. model, which should be comprehensive enough to capture most of the important aspects

involved in querying the Web. On the Web, data consists of files in a particular format,

HTML, with some structurllig primitives such as tags and anchors. GeneralIy, the

structure of H'ïML mes is irregular, irnplicit, partial and fiequently changing. These files

do have some structure but it is too irregular to be easily modeled by ushg a relational or

an object-oriented approach [19], especially when the structure is nested or cyclic.

Accordingly, these kinds of files are called semi-structured files [18]. How to modei serni-

stmctured files is thus an essential issue in querying the Web.

Another important issue in the area of q u e m g the Web is extracting information

nom the Web. The irregularity of the structure of web pages results in the ditnculty of

extracting the information. This problem has been studied and partiaiiy solved for SGML

documents [7, 81. The idea used here is to map the underlying grammar of the document

to an appropriate database schema. Thuq when the document is parsed by using this

gramrnar, corresponding objects would be created in the database. However, when deaiing

with HTML files in the same way, grammars show important limitations. Fïrst, the

structure of HTML files is not always completely defued. Second, the structure can be

irreguiar and H T b E files ofien contain errors, in the sense that they do not fùlly comply

with HTML grarnmar niles, e-g., missing tags are a common example of these errors.

Moreover, information gathering on the Web lays its emphasis on navigation via

hyperlinks that relate documents to one another. Under these c~cumstances~ the design of

a parser or other tools used to extract information fiom the Web becomes more difficult.

A query language is also a very important issue of web querying and an absolutely

necessary component of web querying systems. Basically, this kind of language should

have the power of traditional query languages and also support richer data types, ailowing

recursive queries, etc. Recently, query languages for the Web have attracted a lot of

attention [10]. Several SQL-iïke query languages have been designed for the Web,

although these languages still need a sound theoretical foundation. Another trend of web

querying is to exploit the existing database query language. These languages are based on

weil-dehed theory and are fairly mature, thus should be able to provide more powerful

query facilities.

3.2 Modeling the World Wide Web

As mentioned above, web modeling is a very important issue of web querying. There

are several data models that are proposed to describe the Web. In this section, two

dEerent data models, namely, OEM (Object Exchange Model) [9, 20, 221 and ADM

(Armeus Data Model) [ I l ] are introduced.

3.2.1 Object Exchange Model (OEM)

Object Exchange Model (OEM) is proposed to represent semi-structured data. It is a

simple, self-describing model with object nesting and identity. Data represented in OEM

can be thought of as a graph, with objects as the vertices and labels on the edges. Entities

are represented by objects. Each object has a unique object identifier (oid), a IabeI and a

value. The Iabel is a string denoting the ccmeanhgy" of the object. The value can be fiom

one of the disjoint basic atomic types, such as integer, red, string, &, titml, audio, etc.

The value can also be a complex object which is a set of sub-objects. An object is thus a 3-

tuple: <oid, label value>. A database D = < O, N > is a set O of objects, a subset N of

which are named objects. The intuition is that named objecîs provide 'entry points77 into

the database fiom which sub-objects can be requested and explored- An OEM database

c m also be easiiy viewed as a relational database with a binary relation VAL(oid, Errc ,.rr-lf Gr.- 4L.. r,lr.-- -L. ...Lm-:- -f:--irL- QZ.id Ci A-----

u r v ~ r l i ~ - v C U U ~ J LW ~ p ~ u ~ y uie uic v r u u ç ; ~ UA a~uiiri~ u u j ~ t 3 ilai y r&àiiùn

MEMBER(oid 1, label, oid2) to spec* the values of complex objects. As can be seen, the

design of OEM is intended to make it a simple, flexible and powerfiil data model to

describe semi-stnictured data.

There are some minor variations on the OEM graph model that are used in querying

the Web [21]. These data models use a labeled graph or a graph schema, with nodes

representing web pages and labels representing hyperlinks.

Figure 3.1 illustrates an OEM graph. In this figure, &19 is the identifier of an object

whose complex value is a set containing, arnong others, the pair ("category", & 17). & 17 is

the identifier of an atomic object whose value is "gourmet".

Figure 3 .1 : h OEM Graph [ 1 O]

3.2.2 Araneus Data Mode1 (ADM)

The Araneus Data Mode1 (ADM) is a page oriented data model for the Web. This

means the main consmict of the model is that of a page scheme. Each page scheme

descnbes the structure of a set of homogeneous pages. Each web page is thus considered

as an object with an identifier (the URL) and a set of attributes, one for each relevant

piece of information in the page. The attributes used to describe a web page can be either

simple, like text, image, or link to other page, or complex. Complex attributes are

essentiaiiy lists of items, possibly nested. Based on this perspective, the ADM scheme cm

be seen as a collection of page schemes, comected using links. Figure 3.2 shows an ADM

scheme with one of the example page schemes correspondhg to a web page shown in

Figure 3 -3.

Figure 3.2: A Sample ADM Scheme [l I l

Leonardo Da Vinci

Figure 3.3: A Sample Web Page [IL]

Figure 3.3 shows a sample web page containing the publications by an author. Its

corresponding page schema AuthorPage can be found in Figure 3.2, which shows an

ADM scheme for the DB&LP Bibtiography home page. For each author in the DB&LP

Bibtiography home page there is a similar web page, and al of these web pages share the

same structure. Therefore, they can be described by the same page scheme. The

AuthorPage scheme has two attributes: Name and WorkList, which is a List of

publications, i.e., a set of nested tuples. For each paper Listed in the Worklist, there is a

page scheme, Conferencepage or JounalPage. And there are attributes contained in these

two page schemes, and so on. As can be seen, an ADM scheme deals with structured web

pages in which data are organized according to precise structures and web pages present

strong regularities. Therefore, it is suitable to build database abstractions of large and

fairly weii-structured web sites.

3.3 Web Querying Systems

Several web querying languages and systems have been recently proposed. Most of

the efforts are concernai with issues such as the development of data models and query

languages for the Web, denning formai semantics for the proposed languages and

implementation issues. In some of these systems, such as WebSQL [3, 171, W3QL [13]

and WebLog [l], there is a very simple notion of scheme, and web pages are considered

within a single type, Le., as nodes in a graph, with at rnost a fked set of attributes. In other

words, these kinds of systems use OEM or variants of OEM to model the Web. There is

also a kind of web queryïng system which intends to deploy the regular structure

presented in web pages and therefore uses a more complex data model. For example, the

Araneus Project [I l ] uses a page oriented data model to d e m i e the Web. Another trend

to retneve information &om the Web is to focus on the generation of wrappers, wbïch can

facilitate database-like querying of semi-stnictured data retrieved directly fiom Web

serven. This kind of web querying system is cded a mediator-based system. Related

work in wrapper generation and mediator-based systems can be found in [12].

Tn thir section, we lrst h t - d ~ e WehSQL, 2 web n,~e~.;,?g p;stem b.S,d c:: z sirnpk

graphic data model. The other two systems similar to WebSQL, ive. W3QS and WebLog

are also introduced, but ody the ciifference is emphasized. The mediator-based systems

present a different system architecture because of the existence of the wrapper and

rnediator. We illustrate in this section the architecture of the wrapper and mediator and

also introduce the generation of a wrapper.

3.3.1 WebSQL, W3QS, WebLob

WebSQL is developed at the University of Toronto. Its query Ianguage is an SQL-like

language for querying Web sources by exploithg the structure and topology of the

document networks. The distinct feature of WebSQL is that it provides a fomal semantics

and emphasizes the distinction between local and remote documents. Figure 3.4 provides a

system overview of WebSQL.

Traversal and Index Querying

WebSQL Compiler

{ the World Wide Web \

User interface

Figure 3 -4: The Architecture of the WebSQL Syaem

Object Code

In the WebSQL system, the User Interface accepts the user query and passes the

query to the WebSQL Compiler, where the user query is parsed and translated into a

custom-designed object language. When the Vutual Machine receives the object code

generated according to the query, it executes the object code and sends the requests to the

Query Engine which finaily performs the query, extracts the information of interest fiom

the Web and retums the results to the Query Engine. M e r the Query Engine passes the

results to the Vimial Machine, the Vïmial Machine turns the results into the HTML forms

and then displays the results to the user [3].

In WebSQL, the hypertext structure is represented by a graph data model [17J. This

model then can be viewed as a relational model composed of two virtuai relations: one for

web documents and the other for anchors in web documents. The relational abstraction of

the Web d o w s one to use an SQL-iike query language to pose queries on both content

and hypertext structure. Although the WebSQL quev language is designed as a subset of

SQL, it is a simulation of SQL. Therefore, it cannot be as powerfùl as SQL and a lot of

- Requests Results Virtuai Machine Query Engine

Lists of URLs

work needs to be done in designing the quexy language, such as query opthkation, etc.

which actuaiiy has been well studied in existing database systerns.

W3QS, developed at the Technion, Israel, is a system for SQL-like querying for the

Web. The system architecture is slightly different from that of WebSQL. The feature of

W3QS is that it interfaces to user programs and UNIX seMces for anaimg and Htering

semi-stmctured uiformation fiom Web servers. It d o w s the use of PerI regular

expressions and c d s to UnUr programs frorn the "where" clause of an SQL-lïke query, and

even c d s to Web browsers. Moreover, the language has been designed to be highly

extensible, and tools for managing Web f o m encountered during navigation are

presented [ 131. Again, advanced database techniques are not exploited in W3 QS either.

Different fiom the above-mentioned two web querying systems, WebLog, developed

at Concordia University: Montreal and University of North Cailina, emphibe~

manipulating the intemal structure of Web documents. Its query language is based on

Datalog-like recursive niles (11.

3.3.2 Wrappers Used in Querying the Web

In a mediator-based system, wrappers are the essential components built around

individuid information sources. They are used to accept queries from the mediator,

translate the query into the appropriate query for the individual source, and retum the

resuits to the mediator. They make the Web sources look like databases that can be

queried through the mediator's quexy language, Le., a database query language or a

cust om-designed query language. Figure 3.5 shows an example of mediator architecture.

In this figure, sources represent several related Web sources in a particu1a.r domain of

interest. AU of them should conform to the same format. Mediator here is used to

integrate information from multiple Web sources, i.e. Source 1, 2, 3, and it is made for a

particular domain of interest.

Figure 3.5: A Sample Architecture of Mediator and Wrappers

When a wrapper is generated for a new Web source, the following steps are involved.

First, the web pages need to be structured, i-e., iden-g sections and sub-sections of

interest on a page. Then, a parser shodd be built for the source pages to extract the

sections of interest. Findy, communication capabiiities between the wrapper, mediator

and Web sources shodd be added so that wrappers can fetch the pages containing the

requested information from the Web source and return them to the mediator. The key idea

of generating a wrapper is to exploit formatting information in web pages to hypothesize

the underlying structure of a page. Once the correct structure is obtained, a wrapper for

the source can be generated without much effort or time and information of interest can be

obtained. When web pages are loosely structured, such as personal home pages, building a

wrapper becomes a difiicult task [12].

3.3.3 Summary of Web Querying Systems

Ail the current systems used for querying the Web provicie a query language. These

languages are either SQL-like or Datalog-üke and aiIow for expressing both structure

spec@ing queries, based on the organization of the hypertext, and content quenes, maidy

based on information retrïeval techniques, e-g. search eagines. In mediator-based systems,

database query Ianguage can be used as a mediatofs query language as weil as other SQL-

like query languages. This kind of system is, however, suitable only in the cases that web

pages present strong structures. For those web pages that are loosely structured, most of

the systems exploit an SQL-like or Dataiog-like query language to query them. None of

them exploits a database query language and thus cannot benefit eorn the advanced

database technologies. HaWig this in mhd, we are m g to develop a system that uses an

existing database query laquage to query the Web and show the power of a database ciucïy* &-lg&2g k, -*-< waT.,2 q 5

3.4 A Review of Database Query Laquages

Database systems have been in existence for more than 30 years and have been

successfùily used for a wide range of areas of applications, such as in business, industry,

scientiiic research, engineering, and most recentiy on the World Wide Web.

One of the major purposa of database systems is to store data while providing ad hoc

query facilities to query these data To accomplish this purpose, Dr. E. F. Codd proposed

the relational data mode1 based on strong mathematicai foundations in 1970. During the

1970s, research and development work on relational database systems was carried out and

several prototypes were developed. The SQL-based database systerns of the 1980s

provided, for the fkst tirne, a single language to span the whole range of applications, with

support for multiple views of data and independence nom physical data structures. Since

that time, relational databases have grown fkorn strength to strength. Because of the

success of these systems, relational databases have become ubiquitous and SQL has

become a world standard database language. It was, however, discovered soon that

traditionai database query languages have less expressive power and Limitations exist. For

example, they support only limited data types and can not compute arbitrary transitive

closures. Some of these challenges faced by reiationai systems derive fkom the need to

store and retrieve new m e s of very large objects with cornplex state and behavior, such

as multimedia objects and data from the Web. In the mid- 1 %Os, a new type of database

system was emerging to meet these challenges, that is, object-oriented database systems.

These systems address many of the weaknesses of relationai databases by providing

object-oriented features, supporting a ncher data type. Recursive traversing of object sets

is aIso possible in these systems. Meanwhiie, extensions to the relationai model have been

defined recently by introducing the concepts of an Abstract Data Type (ADT) and nested

relations to the relational model to improve its object-orientation, leading to so-cailed

& j e f l - r & ~ ~ d d;?t&pse rywms. R-&fi~gA yery I ~ q ~ g e s h z v ~ eqen&d

and new features have been added. For example, SQL3 (Stmctured Query Language) is

an effort to turn ANS1 SQL-92 into an object-relational query language. Compared with

SQL-92, the new features of SQL3 include not only M e r developments and extensions

of existing concepts, but also some completely new concepts. One extension in query

facility is to extend query possibilities, for example, by using recursion. New features of

SQL3 include supporting ADTs, nested relations, etc. One exarnple of such systems is

DB2, which is a substantial advance over traditionai relationai systems. The new features

of DB2 include major innovations in query optimization, recursive union, active databases

(triggers), and stored procedures [24]. It integrates object-oriented ideas with the SQL

language to produce an object-relational database management system and provides new

functions and data types, including data types for stonng large objects. More importantly,

it provides a means for users to d e h e additional hctions and data types of their own to

meet the specialized needs of their applications.

Evolving till now, database technoIogy has become fairly mature. It is weli known

that database systems offer efficient and reliable technology to query stmctured data. It is,

however, a new chalienging issue to apply database techniques to the poorly stmcmed

Worid wde Web. Once web pages can be descnbed by a database schem the

management of web data must be able to highly profit fkom the database technology. Our

attempt to build the prototype system on top of a database systern and then use the

database query language to query the web pages is inspired by this idea The database

management system we use for our web querying system is DI32 Version 2 for common

semer. As cm be seen in the later chapters, we benefit a lot h m the powerful query

capabilities of DB2 and we are released fiom the design and implementation of a new

query Ianguage for the Web by using an existing database query language.

Chapter 4

The Prototype Web Querying System

Our prototype system is designed and implernented for querying the Web by using

database query languages. The system provides three h d s of quenes: content queries

which are quenes posed on the content of web pages; structure quenes which are queries

posed on the underlying hypertext structures of web pages; and advanced quenes which

are arbitrary queries posed on the Wtual relations of web pages. The structure of web

pages is rnodeled in our prototype system by a simple labeled directed graph. This chapter

describes the low-level design of the prototype system, including the underlying data

model, the virtual relations, the parser used to map the HTML, files to the database and

search facilities developed in the prototype.

4.1 Data Model

The World W~de Web is a large, heterogeneous, distributed collection of documents

comected by hypertext links. At the highest level of abstraction, it c m be viewed as a

graph whose nodes are web pages that are identifïed by URLs and have some arbitrary

attributes. In Our prototype system, the World Wide Web is rnodeled as such a simple

Iabeled directed graph- This model can be viewed as a variant of OEM.

In our data model, each web page is represented as a node in the graph. Each node

has a unique identifier, a label and a value. The identifier of each node is the URL of the

comesponding web page. The label is a string, which is the AnchorText [see section 2-3-21

that describes the hyperiink. The value is a set of attributes describing the node.

Labels are also attached to edges. K a node contains hyperlinks, it must have outgoing

edges to another node. If a node has no hyperlinks, it does not have any outgoing edge

and therefore is a leaf node. For any two nodes x, y, there c m be at most two edges with

different directions between x and y. A node can have at most one edge that points to the

node itself. As c m be seen, the data mode1 is a simplified one since it dows for only one

ünk in any one direction between two pages. Figure 4.1 illusmites the labeled 5rected

graph modeling parts of the web pages for the Department of Computer Science at the

University of Western Ontario.

Figure 4.1: An Example of Labeled Directed Graph Mode1

Figure 4.1 illustrates the structure of 9 web pages found in the Department of

Computer Science web site on August 2, 1998. These web pages are represented by

nodes, which are circles fUed with light gray. al, &2, ..., &9 represent URLs of

corresponding web pages, i.e., node 1 to 9. Labels are descriptions of links between

nodes. A possible edge for a node could be an incoming dge, an outgoing edge or a loop.

An incoming edge of a code is a Link that points to this node. For example, node 2 has an

incoming edge fiom node i labeled "About the Department". An outgoing edge is a iink

pointing to another node. For example, node 1 has an outgoing edge to node 2 Iabeled

"About the Department". A node has a loop when it has a link points to the node itself.

For example, node 9 has a b o p labeled "Return to Top".

4.2 A Relational View of the World Wide Web

m e r the structure of web pages is represented by a labeled directed graph rnodel, we

can easily view the Web as a relational database. The only difnculty here is to define the

value that is a set of attnbutes describing the nodes. The set of amibutes could be very

compleq reflecting the intenial structure of a web page. Each attribute could be related to

a s m d piece of information presented in the web page. For example, a number of web

pages which provide information for Course Descriptions c m be found by the following

LE : ha+:,/*.. r2giL*r.- L~VV. CGUCCU~S'!?? Z k 5 - i 5 . h ~ . Eâch ûf the =Y ~ Z ~ C S

presents the same structure as that shown in Figure 4.2.

0 Course Description

Com~uter Science 41 laib Databases II A seleaion fiom the foliowing topics: dependency theory, object-oriented databases; distributed databases and related dgorithms; database hardware; information renieval,

Antirequisite: The former Cornputer Science 4 10a

Prerequisite: Cornputer Science 3 19aib.

3 lecture hours, haifcourse,

Figure 4.2: A Sample Web Page for Course Description

For these web pages, we can use a set of attributes, such as (Description, Antirequisite,

Prerequisite, load) to dethe them. In this way, web pages can be describeci precisely and

there is less chance to lose Uiformation in web pages when they are mapped to the

database. Most web pages, howwer, are loosely structureci. For example, personal home

pages. Aimost each of them has its unique style. To define attributes Wce what we do in

the above example is almost impossible. As a tradeofS we take a minimalist approach to

determine the attributes, which captures only common features of web pages.

Generally, in an HTML f3e corresponding to a web page, there is always a pair of

title tags, i-e. <title> Title </title=+, which provides the title information of that page. This

information can be used as an attrïbute describing the web page. Also, there is some other

general information that can be found in a web page, such as the narne of the author, the

number of links contained in the web page, last modifieci date, and the size of the

~nrrespmding -HI'?&- a-, etc. IIe~re, i set cf z*ih~tes iis& tc describe z xbitru~"; vyel?

page can be obtained, e.g. (title, author, linkno, last-moaified, size). Once we have

assigned a vaiue to a node, Le., a web page, we can associate a node with a hiple in a

webpage relation:

Here, the uri represents the UEU of the web page and is thus the primary key. Except for

Iinkno and size which are integers, all other amibutes are character strings. Except for the

priinary key, d other attributes may be null. As can be seen, this relation provides the

generd information for web pages. It gives a web page a highly abstract description.

However, when web pages are mapped into this relation, some of the information may be

lost. As a result, content queries cannot be executed precisely. In order to overcome this

Limitation, we need a supplementary relation for the node. This relation is defhed as

follows in our prototype:

webpage - d (urt, content)

Each node in the data rnodel is related to a tuple in this relation and the primary key url is

the URL of the node. Attribute content is associated to the whole HTML file of a node.

Here, we take the advantage of the new data type 'CLOB (Character Large Object)'

provided in DB2. This data type can contain up to two gigabytes Q3' - 1 bytes) of single-

byte ciiaracter data It has the ability to hold a whole HIML Be. By using this r,dation,

information in a web page will not be lost when the web page is mapped to the database.

Hence, content quenes can be executed precisely. The reason why we use two relations to

describe a web page is that relation webpage-d is used for content queries only and

relation webpage is used for constructing query results.

One motivation for developing new web q u e m g systems is that current search

engines cannot use the structure in the Web documents. To address this problem, new web

que*g - - systems should provide the function of q u e ~ n g the h~ertext structure To

irnplement this hction, there should be a relation in which the relationship between two

nodes is represented. In our data model, we capture the information present in a hyperlink

as a tuple in a links relation:

where uri-a and url-b are the URLs of the origin and destination of the iink, i.e., url-a,

uri-b correspond to nodes A and B with relationship: @=@ Ail these

attributes are character strings. The primas, key for this relation is the combination of

urI-a and url_b. Ody description may be n d .

Based on the labeled directed graph model, the three relations introduced above

model the Web as a relational database. They capture both the information and the

hypertext structure presented in web pages. This relational abstraction of the Web allows

us to use a database query language to pose the quenes on both content and structure.

4.3 System Overview

Conceptuaiiy, our prototype web querying system has the following components:

Interfaces, that accept the queries, present the results and guide the users to other

fundons provided by the system;

A parser, that exiracts the infomtion fiom the HïML fiies and creates tuples in

the database for each web page in the database;

Query facilities, that invoke the appropriate search processes to provide content

query, structure query and advanced query;

Supplementary fùnctions, that provide facilities for the users to maintain the web

pages stored in the database.

User

-7- result s ot her operations

Interface L

A 4 4 1

operations operations on database results on local £iiq results

'--T- Y/;< request s

r%l Disk

the World Wide Web m m Figure 4.3: The System Architecture

Users interact with this prototype systern via an intefiace, in which they can choose to

pose a query or do other operations, such as adding a web page to the database, deleting a

web page fiom the database, or displaying the information of a locaiiy stored HTML file.

When users decide to pose a query, there are three kinds of quenes with three dZFerent

interfaces provided to the users, namely, content query, structure query and advanced

query. M e r users specirjl a query, a corresponding query process is invoked and a query

is performed on corresponding relations. Finally, the query results are displayed to the

users. Currentiy, our test web pages are locally stored and paned by a parser so that

information fiom a web page can be stored in the three relations introduced in the

previous section.

4.4 Mapping HTML Files to the Database

The 1--- --------c i iio nsy ruliipvticxir ûf iiiâpefig âi1 X E v Z Eb iü d a i d j ~ b k i h ~ p & ï ~ ï . ii

provides a means of extracting the uiformation of interest fiom HTML files and storing

them into a database, e-g., the three relations mentioned above. In this section we

introduce in detail how the parser rnaps HTML mes to the database.

4.4.1 Extractkg Information from HTML Files

In Our prototype, the World Wide Web is modelai as a labeled directed graph, which

cm be represented by three relations in a relationai database. These three relations are:

webpage ( id , title, author, linho, last-modzFed, size); webpage - d ( i d , content); links

(ru-a, uri-b, descripton). Hence, the information we need to extract fkom an HTML file

is related to the attributes in these three relations. Assume we are parsing a web page

whose comesponding HTML file is calleci samp1e.hmil. What we would Like to extract

from sample. htrni are the URL. of this web page, title, author, the nurnber of hyperfinks

contained in this file, last modified date, size of the file, ail the hyperiînks that can be found

in this me, and their corresponding description. The algorithms used in extracting the

information fiom an HTML me are described in the next severai sections.

4.4.1.1 Extracting title Information

The Me element is common in ali HTML files. The ElTML DTD (Document Type

Definition) specifies that a <title container be included in an He and there should

be only one <title container in any file [4], although there may be exceptions in some

HTML files. Generally, the rirle should identi@ the contents of the document in a global

context and the tïtle text should be included between <Me> and </Me> without any other

markups, such as anchors, paragraph tags or highlighting. The syntax introduced above

makes it easy for the paner to extract the titIe information fiom an HTML file. The

algorithm is as simple as looking for the string starting with <titIe> and ending with

c/title> and then extracting the string between <titi@ and dtitie?

4.4.1.2 Extracting Hyperlinks and Description

HTML supports hyperlinks through the anchor tag in the forrn of

The anchor object within a <a> container consists of text or another type of object, e.g..

an image. This mchor object when defined within a web page defines a hypertext

relationship to another web page. Both the start and end tags of the -> container must be

specified. It is the obligation of a browser to display an anchor object in a distinctive

manner so that its role is obvious to a reader. Based on this syntw extracting a hyperlink

£tom an HTML file has the following steps:

step 1: look for the string starting with "Ca ";

Note: Since the definition of a hyperlink starts with ff<a'' and there rnay be some

other markups between "a" and "href ', we can not simply look for "Q href' to

find an hyperlink.

step 2: if "<an is found, keep on looking for the string startuig with "hrefc"";

step 3 : if "hre+"" is found, extract the characters that foliow "bref-"";

step 4: stop when encomtering " "" .

By these four steps, we can obtain the Dest(lRL, which is the URL of another web

page. Now, we should continue extracthg the anchor object that is the description of the

DestURL .

The rmchor object may be a string or something else, e-g. an image. For exarnple, if

the anchor object is an image, it c m be defined as foIIows:

<a bref=" personal. htrnlIt><img src="personaLgif"x/a>

Here, the anchor nhject i ~r imsge, &&red hy " c i m m -.a da- c ~ ~ = l ~ n e r c o n - l rur dUAIuA-bAA (~ ;P IY , VXA;PL .. Lu-Ll

references the graph to be displayed as the anchor for the Link. Since it provides no text

description but the file name of the graph, we shall only assign an empty string to the

anchor object in this case.

Shce &or objecr appears between <a> and </a> foilowing the defition of a

hyperhk, we cm begin to extract the anchor object right after the DestURL has been

fetched. In order to get DesrWRL and anchor object one by one, the following steps

should be added after the four steps introduced above:

step 5: continue with step 4, analyze the string between "<a hrePDestURL">"and

"</a> 1,;

step 6: ifthere are strings starting with "<" and ending with ">", skip the string;

Note: Sometimes there are other markups for the anchor object.

step 7: fetch the string that is not containeci in any "<" and "Y pair if availabie;

othenvise, assign the anchor object as an empty string;

Note: Since we are ody interesteci in the textuai anchor object, any other anchor

objects that are not character h g s wiU not be emacted.

step 8: stop when string "</a>" is encountered-

The above 8 steps are used to extract the DestLIRL and the anchor object fiom an

HïML file. They just illustrate a generai idea and some other things should also be

considered, such as skipping spaces, etc., when the paner is hplemented.

Finally, it is worth mentioning that DestURL sornetimes appears as a relative URL if

the destination document is on the sarne Web server as the source document. If we just

simply fetch the relative üRL, we can not obtaui usefil information. In our prototype, we

convert relative URLs to absolute URLs since the equivalent absolute URL cm dways be

constnicted from the URL of the current document and the relative URL. The aigorithm

us& to cgnvcfla r e l ~ ~ y e LpL t_c a &SC!*= L@L iz US fific-+y~:

Suppose the syntax of an absolute URL is:

and we know the current URL, which is the üRL of the web page in which the

relative URL is defined-

if the relative URL begins with "/"

fetch the string "<Protocol>://~Host>" fiorn the current URL;

append the relative URL to the above string;

eise if the current URL ends with "/" or the relative URL beguis with "#"

append the relative URC to the current URL;

else if the current URL ends with ",htmlW or " .htmV

fetch the string "<Protoc~B://~Host>/~Path>/~' f?om the current URL;

append the reIative URL to the current URL;

4.4.1.3 Extracthg other Information

Other information that shouid be extracted fiom the HIML file incfudes the number

of links, author, las modified date and the size of the file. The number of links can be

obtained by using an incremental counter when we parse the HTML He. The name of

author or last modified date can be found nom the meta-infornation provided by the meta

element defïned in HTML or somewhere in the document specifyuig nich information.

The method used to get the ùiformation is simila. to the rnethods uitroduced above. For

the size of the fie, we can simply use a UNIX utility to get it.

4.4.2 Stonng the Information in the Database

When information that describes the web pages has b e n extracted fiom HTML files

successfùliy, it should then be stored in the database relations. As an exarnple, the tuples in

the following three relations are derived fiom the Hïl& nle shown in Figure 2.3 by the

parser developed for our prototype system.

webpage:

webpage-d:

xiaoyu yang's

home page

Table 4.1: Relation webpage with 1 Tuple

Ur1 I content

Table 4.2: Relation webpage-d with 1 Tuple

Note: . In relation webpage, since the HTML does not provide infiormation about author,

the author column is thus empty.

. In relation webpage-d, the column content refers to the HTML file.

links:

description

persmai information

research work

http ://www-csd.uwo.ca/

http YrIvww.csd.uwo.ca/gradstudentS/

Courses

http ://www .csd.uwo.ca/

http d/www.csd.uwo.ca/pradstud~

department of

cornputer science gradstudents/students/xyang/

http ://www.csd.uwo.ca/ http Y/www.csd.uwo~cal~ed

middiesexhd micidiesex coilege

the university of

western orrtano

http ://www -gov.on.ca/

http ://canada.gc. ca

Table 4.3: Relation iinks with 13 Tuples

4.5 Query Facilities

Our prototype system provides three kinds of query: content quesr, structure query

and advanced query.

45.1 Content Query

Content queiy is the query that rekis :s tr,e content of the documents only. In our

data model, content queries are posed on the node. Hence, it involves t w ~ relations in our

relational rnodel of the Web, Le., relation webpge and webpage-d. SSimilar to search

engines, this facility provides users a very simple query interface and allows users to enter

keywords for the query. Figure 4.4 shows the query interface for the content quesr.

-- -- - -- -

Figure 4.4: The Query Interface for Content Query

When the user enters keywords, the keywords will be used to compose an SQL

statement by the query processor. In this SQL statement, the keywords define the

condition of the search, therefore, they appear ody in the where clause. For exarnple, if

the user enters the keyword "Database", the correspondiig SQL statement generated by

the query processor is as foliows:

select w.url, w-fitle, w-author, w-iink-no, W. last-mow, W. size

from webpage w, webpage-d d

where w.url=d. url and &content like '%Database%'

When this SQL statement has been executed, a set of results can be obtained. Each of the

results refers to the web page that contains the keywords specsed by the user. The query

processor then restructures the results and displays thern to the user. The format of the

result presented to the user is Uustrated in Figure 4.5.

1 - URL:

Tif le:

Total Links: - Last modified: - Page Size - 2. URL:

Titie:

Total Links: - Last modified: - Page Size -

By: -

By: -

Total 2 match(es)

Figure 4.5: Results of a Content Query


A structure query is a query posed on the underlying hypertext structure of the web

pages. In Our data model, it quenes the structure of the graph. Therefore, only relation

links is involved in structure quenes. The design of a structure query is highly inspired by

the style of QBE (Query by Example) [15], in which users only need to fiu appropriate

places in an empty table to pose the query. The query interface is show in Figure 4.6.

When a user poses a quexy, the query processor WU dynamicdy prepare the query and a

corresponding SQL statement will be constnicted according to what is specified by the

user in the query. Here, we take advantage of dynamic SQL in DB2 so that the user cm

spec* his or her queries in a flexible way, that is, a query can be composed by any

combination of the fields appearing in the interface. For example, assume the queq is

posed as foiiows:

Source U K : http://www.csd.uwo.ca/

Destination URL:

Description: graduate program

The corresponding SQL statement generated by the structure query processor is:

While the SQL statement is executed, a set of results wili be coliected, formatted and

presented to the user. The format of the results is shown in Figure 4.7.

-- Structure Query -

Please d e h e your query:

Source URL: -

Destination URL:

Description:

Figure 4.6: The Query Interface for Structure Query

1. Source URL:

Destination tTPJ,:

Description:

2- Source URIL:

Destination URL:

Description:

Total 2 match(es)

Figure 4.7: Results of a S tnicture Query

The Advanced Query simulates the CLP (Command Line Processor) provided in

DB2. It can interactively accept SQL statements f?om a user and execute them, and

display results of various data types. The query can be posed on any of the three relations

that are proposed in our prototype. This kind of query offers a flexible rneans to pose

queries. However, it also makes the implementation of the query processor more difficuit,

since we do not have a due what the users will include in theû queries. For example, we

do not have any idea about the number of colums in the result set and their data type.

This kind of information, however, is essential in executing a query. On the other hand,

DB2 offers a number of descnptors for passing data types andor values between the

application program and the database, e-g. SQLDA, which can describe the data types,

lengîhs, and values of a variable number of data items. These descriptors are more flexible

than a List of host variables used in normal dynamic SQL, because they can be dynamicdy

configured for different numbers and types of data items at run tirne. The implementation

of our query processor is based on these reference variables provided by the system. M e r

a query is executed, the query processor analyzes the results of several descriptors.

According to these results, information about the number of columns in a result, the data

type and the value could be obtained. AU these kinds of information are what compose the

quey results. In an advanced query, a user interface is developed to coUect and process

the mihoc quenes posed by the user. This interface is show in Figure 4.8.

The results of a query correspond to what are specified in the select clause of a user

SQL statement. When displayhg the resuits, the UNIX editorpico wiil be invoked and the

results will be displayed by pico. The reason we exploit an existing editor is that it can

provide convenient edit tools for the query results. In this way, the user cm choose to

Save the results to a füe or select only part of the results to be saved. It is aiso usefil when

the user wants to compare several query results, because the previous results are not lost.

- Advanced Query -

** Quenes can be posed on the following three relations: **

webpage (4 title, author, linho, l~ l~ t~rnodr f i ed size)

webpage - d (id, content)

links (url-a, url - b, description)

Note: column content can not appear in a select clause

Please enter an SQL statement or 'quit' to Quit:

Figure 4.8: The Query Interface for Advanced Query

1. Column Name 1:

Column Name 2:

C o l m Name 3:

2. Column Narne 1:

Column Name 2:

Column Narne 3:

-- - - -- -

Figure 4.9: Resuits of an Advanced Query

F i s r e 4.9 shows the oeneral format o f a rem!? set ln this fisre, C.c>lumn Names are

corresponding to the narnes that have been specined in the select clause of the SQL

statement. Please note that only the attrîbute names in the three relations can appear in the

select clause. The only exception is that the attribute "content" can not appear in the select

clause, because the data type of this attribute is CLOB, which is not suitable to be

displayed in a Iimited space.

4.6 User Interfaces

Besides the query interface introduced in section 4.5, our prototype system also

provides a simple user interface, by which the user c m select dierent functions. Basicdly,

there are two main user interfaces, one provides the access to the three dBerent query

facilities, which is shown in Figure 4. IO; the other guides the user to the supplernentary

tiinctions used to maintain the web pages stored in the database. This interface is shown in

Figure 4.1 1.

Querying the Web Main Menu

1. Content Queries Ody

2. Structure Queries Only

3. Advanced Queries

4- Others

x. Exit

Figure 4.10: The User Intedace for Accessing Query Facilities

4.7 Supplementary Functions

There are several supplementary fiinctions in our prototype system providuig a means

to access the relations used in our prototype. Currently, Our system provides the following

fiinctions that can be used to:

find hyperlinks from a locaiiy stored HTML file;

create a relation in the database;

add or delete a web page from the database; and

display the tuples in relations webpage or links-

Others

1. Find Hyperlinks from an HTML File

2. Create Relation

3 . Add One Web Page to Database

4. Display Rel~iion webpage

5. Display Relation links

6. Delete One Web Page fkom Database

Figure 4.1 1 : The User Interface for Supplementary Functions

4.8 Summary

This chapter focuses on the implementation of our prototype web q u e m g system.

For rnodeling the Web, we use a iabeled directed graph, whkh gives us a straightfonvard

view of the hypertext structure. Based on this data modei, it is possible to map the web

pages to relations in a relationai database. Once in the database, web pages can be queried

as other data in the database. However, web pages have their special features, such as

their hypertext structure and irregularity. Therefore, queries on web pages have specid

meaning compared to traditionai queries. One distinct feature in querying the Web is that

hypertext structure should also be considered, which is leading to the development of the

Nc tu re query. To accornplish this, three basic queqi facilities are provided in our

prototype system namely, content query, structure query and advanced query- Content

query is used to accept user keywords and then fhd the web pages that contain the

keywords. Structure queiy lets us go beyond the keyword search so that hypertext

structure c m be queried as well as keywords. The advanced query is designed for ad hoc

queries. It provides a powemil means to query the web pages stored in the database

flexibly. Besides these functions, there are also some other functions that are used to

maintain the web pages stored in the database. These functions are especialfy useful when

we develop and test the prototype. During the development of the prototype, we benefited

a lot fkom exploithg an existing database, Le., DB2 in our case. And its power will aiso be

show in the next chapter when we pose queries on the web pages. As cm be proved, if

the Weh is modeled hy 5 p g e r data nide! and the mode! cru? be reprce??tec! hy re!&cns

in a relational database, advanced database technology will provide high level query

facilities to query the web pages in a flexible and powemil way.

Chapter 5

Querying the Web Pages

As descnied in the previous chapters, our web queryhg prototype systern provides

three alternatives to pose the query: querying the keywords, querying the hypertext

structure and querying arbitrary content. This chapter &arts by describing the distinct

features of these three kinds of queries. Then, the results obtained by running various

types of quenes are examined.

5.1 Query Methods

The essential query in web querying is the content query, which is also known as the

keyword query. This kuid of query is provided by dl search engines. The idea of content

query is to fhd the web pages that contain the keywords specified by the user. Obviously,

querying the Web by this means only is not enough, because in content query, there is no

way to explore the hypertext structure, which, however, is an important feature of web

pages. Therefore, we developed a query facility called structure query to deai with the

hypertext structure. In addition, our prototype also provides an advanced query facility,

which allows users to pose queries in a flexible way. This advanced query facility is able to

query al l the web pages stored in the database. Ln the foiiowing subsections, we illustrate

how these three query facilïties work.

5.1.1 Content Query

In our prototype system, the content query uses keywords to query the content of a

web page. The possible keywords can be a single word, a phrase, a sentence or anything

that can appear in an H+ïML fie. Our prototype system can even allow the keyword to be

null. In this case, the system assumes the user asks for displaying ail the web page

information stored in the relation webpage, e-g., URL of the web page, the title, the

number of links, etc. Aiso, wild cards can be embedded in the keywords. The wild card

recognized by DB2 is 'W. It can appear anywhere in the keywords, representhg an

arbitrary character string. In our system, the keywords used to search the information of

interest are partly case sensitive. Users have alternatives to enter their keywords in either

upper or lowercase. When in doubt, the user can use lowercase text in queries. In this

case, the query facility fhds both upper and lowercase results. When uppercase text is

used, the query faciiity finds results ody in upper case. For exampie, when the user

searches for "london", al l occurrences of "london", "London" and "LOMlON" can be

found in the results. However, when the user searches for "London", only "London" will

be seen in the results. As can be noticed, to enter the keywords exactly as show in a file

can narrow the scope of the search so that more precise results can be obtained.


The structure query is used to query the underlying hypertext structure embedded in

web pages. Thus, this kind of query explores the web pages by their hyperlinks. In our

prototype system, we provide a QBE-like interface for the structure query. With this

query interface, the user can pose a query by simply fiilhg in the blanks. The interface

provides the foUowing three fields to be fXed in: Source URL, Destination URL and

Description. Source URL refers to the URL of a web page in which hyperlinks are

contained. Destination URL is the URL of a hyperlink contained in the web page.

Description is the anchor text describing the hyperlllik. There are several alternatives that

users can use to pose a query. Ifusers would like to find out aii the hyperiinks in a certain

web page, they just need to provide the information of source URL. If users provide only

the information of destination URL, aU the web pages that have a Link pointing to the

destination URL will be found. Similarly, if users enter both the source URL and the

description, then aU the hyperlinks with the speci£ied description in the web page defined

by the source URL will be found , and so on.

5.1.3 Advanced Query

This kind of query is designed for advanced users. There is no restriction on posing

the query. Users can simply write any SQL statements related to the three relations used in

our system, Le., w e w g e (url. title, author, link-no, l ~ ~ ~ t ~ r n o d z ~ e d , size); webpge - d ( i d ,

content); links (id-a, url - b, description). Here, attributes Iink-no and lmttrnodi~ed are

integers; the data type of attriiute content is CLOB. All the other attnibutes are character

s t ~ g s . What is worth noticing is that only attributes mentioned in the three relations can

appear in the SQL statement.

The user interface of the advanced query is a mimic of the CLP (Cornmand Line

Processor) of DB2. Therefore, the user can just &te the SQL statements and then

execute them. The results of an advanced query totaily depend on what are specïfïed in the

SC!= c!âiisc of SQXL stâ:err;er;t.

5.2 Experimental Results

To test our prototype system and ver* the power ofusing a database query language

to query the Web, we have nin several types of queries and we analyze some of them in

the foiiowing examples. In our experirnent, more than 70 web pages are chosen to run the

test. AU these web pages are reachable fiom the home page of the Department of

Computer Science, the University of Western Ontario on August 2, 1998. For test

purposes, the HINE £iles of corresponding web pages are downloaded to the local disk

and are mapped to the database in advance.

5.2.1 Content Query

Content queries are posed on the content of an HTML fde. Wild card and HTML tags

may be used in a content query. To simpm the query interface, in the following examples,

"Input" illustrates the input entered by the user and "Results" shows the query results

presented to the user.

Query 1: Find the web pages that contain the keyword "database".

input: database

Results:

1. URL: htep://www.csd.uwo.ca/research/

Tue: uwocsd - department research

Total Links: 4 Page Sue 13 15

2. URL,: http ://www.csd.uwo.ca/grad~~~ur~es hbml

Title: uwocsd - graduate courses

Total Links: 12 Page Size 9918

3. URL,: http ://www.csd.uwo.ca/fic~/sylvia/pubsh~

Title: sylvia osbom's recent publications

Total Links: 1 Last modified: 0 1/2 Vl998 Page Ske 2058

4. URL: http ~hww.registrar.uwo.ca~accds/l997/sub-16 hm

Title: cornputer science


5 . LRL: http://www.registrar.uwo.ca/acca1sI 1 997/crs43 5 .htm

Title: course description


6 . URL: http :/htiww.regiçtrar.uwo.ca/acdd 1997krs-420 .htm


Total Links: 9 Page Size 15 19

7. URL: http ~/~~~.csd.uwo.dgradstudents/students/xyans/researchh~

Tie : research work

Total Lniks: 35 Last modified: 02/18/1998 Page Size 4020

8. URL: http ://www .csd.uwo.dgradstudents/studmts/xyang/~~~~html

Titie: course Sormatim

Total Links: 23 Last modZed: 02/18/1998 Page Size 6408

9. LEU: http ://www,csd. uwo.ca/faculS/sylvia_htrnl

Title: syfvia 1- osborn


1 O. UKL: http://www.~strarstraruwoOca/accais/l 997/crs_4O9kn

Titie: course description

Total Lmks: 8 Page Size 1622

1 1. URL: http://www.uwo.ca/gra~esis/&ap la_h.tml#ove~ew

Tïtle:


Total I l match(es)

in the above query, the query results are the web pages containhg the keyword

specified by the user, i-e., "database".

Query 2: Find the web pages that contain the keyword "Database" in their text.

Input: Database

Results:

1. UEU,: http://www.csd.uwo.ca/research/

Titie: uwocsd - department research

Tcstal Links: 4 Page Size 1315

2. URL: htîp :/hcrww.csd.uwo.ca/grad~~~~~~es . h t d

Title: uwocsd - graduate courses


3. URL: http~/~~~.csd.uwo.ca/fic~/sy1via/pubs~~

Title: sylvia osbom's recent publications

Total Links: 1 Last modifieci: 0 1/21/1998 Page Site 2058

4. a: http://tvww.registraregistrar~o.ca/accaWL997/subbI6htni

Tue: cornputer science


5. URL: http ~ / ~ . r e g i s t r a r . ~ ~ o ~ c a / a c c a 1 d 1 9 9 7 / ~ 4 3 5 hm

Tde: course description

Totai Links: 9 Page Size 1413

6. W: http:/hvww.registrar.uwoxa/accals/ 1997/crs420 .hm

Tïtie: course description


7. URL: http://www.csd,uwo.ca~gradstuddstudents/xyang/resear&html

Title: research work

Tatal Links: 35 Last modifieci: 02/18/1998 Page Size 4020

8. URL: htîp://www. csd-uwo. ca/gradstudents/students/~9/~0~~ses htmi

Title: course information

Total Links: 23 Last modifieci: 02/18/1998 Page Size 6408

In the above two queries, the oniy merence is that the keyword in Query 1 , i-e.,

"database", is in lowercase, while the fkst letter of the keyword in Query 2, i-e-,

"Database", is in uppercase. nie results of these two queries are different because content

query is case sensitive. The results of Query 2 are included in the results o f Query 1.

Query 3: Find the web pages that contain the string "computer science 41 1" in their text.

Input: computer science 41 1

Resuits:

1. URL: http ://www.registrar.uwo.ca/accals/l997/crs_435 .htm



2. URL: htîp Y/m-registrar,uwo.ca/accak/l997/çub-16 hûn

Title: cornputer science

3. URL: http://www.csd.uwo.ca/grad/~o~~~es hd

Titie: uwocsd - graduate courses


This is a query that fhds web pages containing a string specined by the user.

Query 4: Fmd the web pages that contain the foilowing keywords which are in the order

of "thesis", "submission" and "deadline" in their coritent.

Input: thesis % submission % deadline

Results:

1. URL: hap ://www.uwo.ca/pd/tbesis/&ap 1 a_html#overview

Title :


In this example, wiid card is included in the keyword, therefore the result web page

should contain the character string that begins with "thesis" followed by another character

string or none and "submission" foilowed by a character string or none and finally ends

with "deadline". As can be seen in the result, since this web page does not provide any title

information, The Title field in the result is empty.

Query 5: Find the web pages that contain meta-information.

Input: <meta%>

Results:

1. URL: hitp ://www .registrar.uwo.ca/ac&

Tie: academic calendars

Total Links: 9 Page S ü e 3 120

2. URL: httpY/www.uwo.ca/

Tue: the miverçity of western ontario

Tutai Links: 22 Last modifie& 1998/02/02 Page Size 4438 By colleen

Some HTML £iles have meta-information, which is used to provide general

Somation of web pages. But this kind of information is not displayed in a web page.

Meta-idiotmation is defined by an HTML tag <meta (attribute list)? Therefore, if we are

interested in the web pages that provide meta-information, we can find these pages by

simply looking for the correspondhg HIML files that contain the <meta> tag.

Query 6: Display the uifonnation of all the web pages stored in the database.

Input:

Results:

Would you Like to display al1 the records? Oh) n

In this query, users do not need to enter anythmg. The system will display by default

ail the information of the web pages currently stored in the database. In case that there are

too many web pages stored in the database, it is not practical to display ail the

information. Hence, if the keyword entered in the content query is null, the system will

prompt the users to confirm whether or not they wish to display of all the uifonnation.

Query 7: Find the web pages that contain string "travel agent".

Input: travef agent

Results:

No match has been found.

Since none of our test web pages contains the keywords "travel agent", there is no

result found-

5.2.2 Structure Queries

Structure queries are posed on the hypertext structure of the web pages. In this kind

oFquery, we expioir a QBE-üice query interface for users to pose quenes.

Query 8: Find ail the hypertinks containeci in the web page whose URL is

" http://www.uwo-car

Input: Source URL: http://www.uwo.ca/

Destination URL:

Description:

Results:

1. Source URL: http://www.uwo.ca/

Destination URL: http ://search. uwo-ca: 87651

Description: search uwo

2. SourceüRL:http://www.uwo.ca/

Destination URL: htîp ://www.wo,ca~aborrtuwo/~rms htd

Description:

Source W: http://www.uwo.ca/

Destination URL: htîp ://www.uwo . d a boutuwo/mcintosh.html

Description:

Source URL: http://www.uwo.ca/

Destination URL: http :/hvww.uwo.ca/academic/

Description: academic programç làcukies and colkges

Source UU: http://www.uwo-ca/

Destination URL: http://www.uwo.ca/admHiservices/

Descnptim: administration and services

Source IIRL: http://www.uwo.ca/

Destination URL: http ://www .uwo.ca/alumni/

Description: aiumni and fiends

Source URL: http:lhnrww.uwo.ca/

DestÏnztion URL: http ://www. uwo .cahmdex htrd mm---&:-. U C i S b I 1puvLl-

Source URL: http://www-uw0.d

Destination URL: http ://www .uwo.ca/ip/

Description: web publishing at uwo

Source üRL: http://www.uwo.ca/

Destination URL: http ~/www.uwo.ca/libinfo/

Description: libraries and information resources

Source URL: httpi/www.uwo.Ca/

Destination URL: http ://www.uwo.calresearch/

Description: research and scholadip

Display more (y/n)? y

1 1. Source URL: httpY/www.uwo.ca/

Destination URL: http://www.uwo.ca/çelected~browsebtml

Descriptim: internet

12. Source URL: http://www.uwo.cal

Destination URL: http://www.uwo.calsguide/

Description: admissions and student resources

Source URL: http://www.uwo.ca/

Destination URL: htîp:/h4rww.uwo~caluw0~0m/

Description: campus idonnation news and events

Source URL: ht&pY/www.uwo.ca/

Destination üRL : htîp ://www.uwo.ca/uw~~~d~~~nmuILity/

Description: western m the comrnmiîy

Source UR];: htîpYhsrww.uwo.ca/

Destination URL: http://www.uwo.ca/whatsnew/

Description: new

Source URL: hîtpY/mwv.uwo.ca/

Destination URL: http Y//www.uwo.ca/whois/

Description: find people

Source URL: h#p://www.uwo.cal W;n-r+;cm lm1 - rn-;k--r.*Lrr.r4-4;r.l;r- .-+-ri rm UuerriluUVU Y&Ui. ILLQIILW- WbVLLLQSLGAwJ ULLQU-UWV-CA

Description:

Total 17 match(es)

In this query, users oniy need to spec9 the Source URL. and leave the other two

fields blank. The results are the URLs of the web pages reachable nom the web page

identined by the Source URL.

Query 9: Find the web pages that contain a Mc to the web page whose URL is

"http://www.uwo.ca/".

Input: Source URL:

Destination URL: http://www.uwo.ca/

Description:

Results:

1. Source URL: http ://www.csd.uwo.ca/~c~/sylvia/

Destination URL: http~/~r~r~.uwo.ca/

Description: the university of western ontario

2. Source URL: httpdhinvw.uwoxa/grad/

Destùiation URL: httpY//www.uwo.ca/

Description: uwo home page

3 - Source URL: http~/bvww.regiStraregistraruwoO~accals/

Destination URL: http Y/www.uwo.ca/

Description:

As shown in the query, when ody the field Destination URL is med, the resuits d l

be the web pages that have Links to the web page specified by the Destination URL.

Query 10: Find the web pages that contain hyperiinks about course 407.

Input: Source URL:

Destination URL:

Description: 407

Results:

1 . Source URL : http ://www.csd.uwo -ca/courses/

Destination URL: http ://www .csd.uwo .ca/courses/cs407/

Description: cs407 - advanceci software engineering

2. Source URL: http ~/~.re~strar.uwo.ca/accds/l997/sub-l6Iitm

Destmation W: http://www.regishar.uw~~ca/acdd 1997/crs-434hm

Description: cornputer science 407a/b

Total 2 match@)

In this query, only Description is specified. The Destination LRL in the result is the

hyperlink whose anchor text matches the string entered in the Description field. The

Source URL identifies the web page that contains the hyperlink.

Query 1 1: Retrieve the anchor text of a given hyperlink in a certain web page.


Destination URL: http://www.uwo.ca/sguide/

Description:

Redts:

1. Source URL: http://www.uwo.ca/

Destination URL: http ://www .uwo.ca/sguide/

Description: admissions and student resources

This kind of query is used in case that users know the URL of a web page and one of

its hyperhnks, and they are interested in what the hyperlink is about. It wiU extract the

detailed description of the specified hyperlink for users.

Query 12: Find the hyperlinks about "submission" and "thesis" in a certain web page.

Input: Source URL: http://www.uwo.cdgrad/thesis/

Destination URL:

Description: submission % thesis

1. Source URL: http://wam.uwo.ca/gradtthesis/

Destïuation URL: http Y//www.uwo.ca/gra&esis/&ap 1 ahtmiffial

Description: 1.5 final subrnission of the examineci and corrected thesis /3

2. Source URL: h#p://www.uwo.dgrad/thesid

Destination URL: htip Y/www.uwo.ca/gradithesis/chap l a ~ # s u b m i s s i o n

Description: 1.2 submission of the thesis fbr examination /I

Total 2 match(es)

As c m be seen in these results, two hyperlinks contained in the web page specified by

the user are found. The keywords "submission" and "thesis" appear in both of the fields of

Description.

Query 13: Find the web pages that contain a link to the home page of the Department of

Computer Science, the University of Western Ontano, and the anchor is about

"department of cornputer science"

Input: Source URL:

Destination URL: http://www.csd.uwo.ca/

Description; department of computer science

Results:

Source URL: hop ://www.csd.uwo.ca/~cuIty/sy1Via/

Destination URL: http ://www.csd.uwo .ca/

Description: department of computer science

Source URL: http :///www.csd.uwo.ca/gradstudents/students/~p/

Destination URL: httpYhvww.csd.uwo.cal

Description: department of computer science

Source URL: http:///www.csd.uwo.ca/gradstudents/çtudents/xyang!persona1hemi

Destination URL: http :/hvww.csd. uwo-cal

Description: department of cornputer science

Total 3 rnatch(es)

When users know a hyperfi.uk and the corresponding anchor text, they can use this

query to find out in which web pages the hyperlink is contained.

Query 14: Find whether there is a web page whose URL is specsed as Source URL and

it contains a Link to another given web page.


Destination üRL: http://www.csd.uwo.ca

Description: computer science

Result s:

No match has bem found-

Since the home pase of the University of Western Ontario does not have a link to the

Department of Cornputer Science's home page, the result of this query is empty.

Query 15: Enter nothing for the query.

Input: Source URL:

Destination URL:

Description:

Result:

Would you Like to display ali the records? Qh) n

If users enter nothing for the query, the system assumes users wodd like to examine

dl the tuples about the hyperiinks. Since there may be too many tuples to be displayed, the

system prompts users to confirm whether or not they want to display the resdts

5.2.3 Advanced Query

Advanced queries can be posed on the content, the hypertext structure or the

combination of the content and the hypertext structure of web pages. This kind of query is

written in an SQL statement, hence provides a flexible means for users to pose the query.

Meanwhile, users can take full advantage of advanced database query facilities when

writing the query in SQL statement.

Query 16: Find the web pages about "database".

Input: select w-url, w-title, w.1in.k-no, w-size \

from webpage w, webpage-d d \

where W. url=d.url and d. content like '%database%' gesu1is-

W: http ://www.csd.uwo.ca/grad/~~~~~es.hbnl

TITLE: wocsd - graduate courses

LINKNO: 12

SIZE: 9918

URL: http ~ / ~ . c s d . u w o . C a / ~ c ~ / s y l v i a h t m l

TITLE: sylvia 1. osbom

LM-NO: 2

SIZE: 1184

URL: http://www.uwoca/gradlthesis/chap la-htmjkbverview

r n E :

L K N O : 1

S m : L3393

URC: http ~/~.registrar.uwo.caiaccals/l997/crs_43Sh

TITLE: course description

LDK-NO: 9

S m : 1413

5. üRL: http Y/'~\~~.registrar.uwo.cafaccaW l9W/crs-4îQ .hm

TITLE: course description

LINK-NO: 9

SIZE: 15 19

6, tTRL: h@://www.registrar-uwo.ca/accaldL 997/crs_4Og htm

T?TLE: course description

LM-NO: 8

S m : 1622

As shown in the above example, this query can be used as a content query. It h d s the

web pages, which have "database" in their text. The merence between this query and the

content query is that the former looks for the web pages contain the exact keyword

specined in the where ciause, wMe the ianer searches the web pages contak the keyword

in botti lower and uppercase.

Query 17: Find web pages whose titles contain "home page".

Input: select url, title \

f?om webpage \

where titie like "%home page%'

Results:

1. URL: http://www.csd.uwo.ca/fic~/syivia/

TITLE: sylvia osborn's home page

2. URL: http ~/\~~~.csd,uwo.calgradstudents/students/xyan9/

TITLE: xiaoyu yang's home page

This query c m b e used to h d out a group of web pages with a particular topic.

Query 18: Fhd web pages which have no outgoing Links.

Input: select urI, title \

fiom webpage \

where 1in.k-no=û

Redts:

This query c m find ail the leafnodes in the graph model used in our system.

Query 19: Retrieve the titie and the URL of ail the web pages that are pointed to f?om the

web page whose URL is http://www.csd.uwo.ca/gradstudents/studentd~m~.

Input: select w.url, w.title \

fkom webpage w, iinks I \

where l.url-a='http://www.csd. u w o - c a / g r a d s t u d e t s d e n t \

and w.url=l.url-b

1. URL: httpY/~~~.csd.uwo.ca/fic~/sylvia~

?TïZE: sylvia osbom's home page

2. URL: http ://www.csd.uwo .ca/gradstudents/studeuts/xyan~co~~~es

TiTLE: course uiformation

3. W: http~/~~~.csd.uwo.calgradstudentslstudents/xyan~interestsh~

TILE: intereshg links

4. m: TITLE:

5- URL:

m E :

6. URL:

m E :

This query is used to find all the hyperlinks in a weo page spec5ed by the user.

Query 20: Find the URLs of web pages mentionhg WebSQL1 from the web page:

http://www.csd.uwo.ca/gradstudentdstudentdxyang/-

Input: select w u 1 \

from webpage w, webpage-d d, links 1 \

where d-content like '%WebSQL%' \

and 1.url-a='hîtp://www. csd.uwo.ca~gradstudentdstudents/xyang/l\

and d.url=I.url - b and w.url=l.url-b

Results:

From a given web page, this query tries to find ali the web pages that contain special

information of interest and are pointed to tiom the given web page.

Query 2 1: Find web pages contain anchors mentioning "publications" in their t ea .

Input: select w-url, w-title \

nom webpage w, Links 1 \

where 1-description Iike '%publications%' and 1-url-a=w.url

1. URL: hûp ~/www.csd.uwo.ca/fàcuhy/syLvia/

TmE: sylvia osborn's home page

This query cm h d al1 the web pages that contain hyperlinks mentioning a particular

topic.

Query 22: Find web pages with a certain topic and reachable £kom a given web page

within three levels.

Input: with \

temp(ur1-b, description, stops) as \

((select url-b, description, cast (O as smaiht) \

fiom links \

where descnption='computer science 4 1 1 ah') \

union ail \

(select 1. url-b, 1. description, cast(t. stops+ l as smallint) \

fiom Links i, temp t \

where t.url-k1.u- \

and Ldescnption like W/computer scienceY~' \

and t.stops<Z)) \

select p-url-b, p-description \

Eom temp p

Results:

1. URL-B: http://www.registrar.uwo.ca/accais/l997/crs-435h~

DESCRIPTION: cornputer science 4 1 1a/b

2. -B: http~/\iivw~.registrar.uwo.ca/accaW l997/crs-42Ohm

DESCRIPTION: computer science 3 19ah

3 _ bTRL_B: http Y/~~~.registrar.uwo.ca/ac&Z997/cr~-4 14.htm

DESCRPTION: computer science 20 2

This example illustrates a recwsive query. Recursive query is an important feature in

the new generation of relationai databases. It 1s very powedbi, because it allows certain

kinds of questions to be expressed in a single SQL statement that wou!d otherwise require

the use of a host program. As c m be seen in the example, the use of recursive queiy lets

us go dong the hypertext hierarchy to find the information of interest, which is impossible

by using any of the search engines currentiy available.

5.3 Summary

This chapter focuses on the query methods implemented in our prototype system and

the experimental results obtained by using these query methods. Currently, there are three

kinds of query facilities available in our prototype system, narnely, the content query, the

structure query and the advanced query. The content query searches the keywords

speciiïed by the user. In thïs kind of query, wild card or HTML tags are allowed to be

included in the keywords. From the experimental results, it is easy to £ind that any word,

phrase, tag, sentence can be used as keyword as long as it appears ui an HTML file. As

can be seen, it is an essentiai yet powerful method to find web pages of interest. However,

there enas a significant limitation in the content quesr, that i s it cm not be posed on the

hypertext structure presented in an HïML fiie. To remedy this defect, we developed a

new query facility which is cailed structure query. 1t aliows users to pose queries on the

UEtL of a web page, the hyperlinks of the web page and their corresponding anchor text.

This kind of query makes it possible to explore the structure of web pages and thus

enhances the quev capability of Our system. In addition, our prototype system offers an

advanced query faciiity by which quenes can be posed in the form of SQL statements. It

aiiows the cornplex queries to be posed to search bot . the content and the structure of

web pages. This query facility takes fùli advantages of the database query language and

therefore provides a very powerfùl web querying means to users.

More than 70 web pages have been used to test our prototype system. The

experimental resuits indicate that users can thoroughly search the web pages and obtain

the i n f ~ ~ d ~ f i cf inte<pry hy t h q ~ e r f f2CllpS prc\<&d h "3 x r n t v r y- Y- nrnfnhme J i'-

system. The experimental results also strongly show that the database query language is a

very powemil tool for querying the Web.

Chapter 6

Discussions and Future Work

6.1 Discussions

In this thesis, we introduced Our protowe system for querying the Web with

database query languages. Currently, there are two ways to retrieve idormation fiom the

Web, namely, navigationibrowsing and searching by search engines. However, there exist

some significant limitations, such as the "lost-in-hyperspace" phenornenon, no query

facilities for the hypertext structure, etc. These drawbacks incited the development of Our

web query systern. The design of our prototype system is based on the idea that the

combination of database techniques and the Web modehg technique shodd provide a

much more powerful and flexible way in queqing the Web.

To query the Web, the first problem encountered is the web modelmg. In Our

prototype system, we mode1 the Web as a labeled directed graph, with web pages as nodes

and labels on the edges. Based on this graph data rnodel, the Web can be viewed as a

relational database, which is composed of three relations: webpage, webpage-d and links.

- This relational abstraction of the Web dows us to store web page information into a

database and use a database query language to retrieve the information of interest.

To extract the information of a web page nom the source HTML file and store it into

the database, a parser was designed and implernented in our prototype system. The user

interface was also developed to accept the query and guide users to d the available

functions of ow system.

As for the query facilities, our prototype system was implemented to support three

kinds of query, oamely, the content query, the structure query and the advanced query.

The content query provides the keyword querying fiuiction that most search engines

provide. Its limitation is that it c m not query the hypertext structure of the web page. To

address this problem, our prototype system provides a particular query facility called the

structure query. It allows users to search the web pages by their underlying hypertext

structure. Furthemore, our system also offers an advanced query facility, by which users

can pose arbitrary quenes in the form of complex SQL statements. This query method

benefits a lot fiom the advanced query capabilhies of the database query language and

allows quenes to be posed on both the content and the hypertext structure of web pages.

The prototype system has been extensively tested with more than 70 web pages. The

testing results show that the database query language has been successfully used to query

weh page* and the wb!e syitern fiE&ms wry weH-

As a newly developed prototype, our web querying system has some limitations. It is

suitable for searching web information in a Limited scope, but not scdable to the whole

web at this moment. Also, at the current stage, our system can not automatically react to

changes of web pages. Moreover, the content query actiially duplicates what search

engines already do. Thus, the fùture work should address overcorning these Limitations

and turning this prototype into a complete web querying system-

6.2 Future Work

There is a lot of work that could be done in the area of q u e m g the Web. The work

is related to several areas in cornputer science, including databases, networks, user

interfaces, and algorithm design, etc. Currently, our prototype system emphasizes only on

the issues of database and provides a fhmework for future development of the web

q u e m g system.

The est and foremost improvement that could be made to our prototype system is to

make the system have the capability of interacting with the Intemet dynamicaily. To date,

our protorype system can only deal with the web pages that are stored on the local disk.

These web pages are downloaded fkom the Web in advance. In order to query the Web

dynamicdy, the prototype system shodd be extendeci in severai aspects. Firstly, the

curent programs written in C shouid be translateci into Java to take advantage of its

strong capabiiity of communicating with the Intemet. Secondly, the content query facility

should be enhanced. The keyword search cm be divided into two seps. In the £kst step,

several search engines are used to query web pages and the results fiom different search

engines are combined to provide more extensive information. A similar work can be found

in [16]. In the second step, the results fkom the first step wiU be further searched by Our

content query facility to eliminate the irrelevant results obtained f?om search engines.

Intuitively, these two aeps could make the search much better and increase the precision

~f the pIdtsS ThL;rgr, the rrA!ez + r n ~ & ~ U y ~ y&*&g y& p p gGr& the

database should be solved. Since one of the features of the Web is that it changes

fiequentiy, mahg the web pages in the database adapted to the changes is very important

for obtaining the latest web information.

The second improvement that could be made is related to the recursive query In Our

experirnent, we tried to pose recursive quenes on hypertext hierarchy without cycles and it

worked fine. But we had some troubles in using recursive quenes on hypertext structure

with a cycle. The reasons are as foiiows: (1) the version of DB2 we used cannot fully

support transitive closure when there are cycles; and (2) it is difncdt to design the

stopping niles for the recursive queries because sometimes the Somation provided by

web pages is incomplete. Since recursive quenes are powerful and vety important, how to

use them properly in querying arbitrary web pages should be well studied in the fùnire.

Another extension to our prototype system is the development of a user-fiendly

interface for spec*g quenes. This interface could be developed in the spirit of Query By

ExampIe.

L. V. S. Lakshmanan, F. Saciri, and 1. N. Subramanian. A Deckùrrive Languuge

for Queryihg and Restructunkg the Web. In Proc. of 6th Intemational Workshop

on Research issues in Data Engineering, EUDE'96, IEEE, New Orleans, Feb. 1996.

G Arocena, A Mendelzon, G. Mihaila. Applications of a Web Query Lmgztage.

In Proc. of Sixth International World Wide Web Conference, 1997.

A Mendelzon, G. Mihaila, and T. Milo. Querying the Wodd Wide Web. In Proc.

of First International Conference on Pardel and Distributed Lnformation Systems

(PD1Sy96), December 1996.

Bebo White. H ï M L and the Art of Authonng for the World Wide Web. Kluwer

Academic Publishers. 1996.

T h Berners-Lee, Robert Cailliau, Ari luotonen, Henrik Frystyk Nielsen, and

Arthur Secret. The World- Wzde Web. Communications of the ACM, Vol 3 7,

No. 8. Aug. 94.

Martijn Koster. Robots in the Web: ntreat or Treat?

http://web.nexor.co.uk~users/mak/doc/robots/threat-~r-treat.html

S . Abiteboul, S. Cluet, and T. Milo. Querying and Upciating the File. In Proc. of

the Corf on Very Large Databases (VLDB), Morgan Kaufbann, 1993.

V. Christophides, S. Cluet, S. Abiteboul and M. Scholl. F m Stmctured

Documents to Novel Query FaciZities. In ACM S IGMOD International Conference

on Management of Data, 1994.

S. Abiteboul D. Quass, J. McHugh, J. W~dom, J. me Lorel @ery

h g u a g e for Semi-Sn-uctured Data. 1 996. http://www-db . stanford. edu.

S. Abiteboul. Querying S e m i - S ~ u r e d Data. In Proc. of the International

Conference on Database Theory, Delphi, Greece, Springer, 1997.

P. Atzeni, G. Mecca and P. Merialdo. Semistructureci and Sfructwed Duta in the

Web: Going Back and Forth. ACM SIGMOD Record, Vol. 26, No. 4, December

1997,

N. Ashish and C. Knoblock. Semi-Aufornotic Wrupper Generation for Intenter

InfomafiOn S m e s . http://www.isi.edulsimd{ naveen, knoblock) -

D. Konopnicki and 0. Shmueii. W3QS: A Query System for the WorId Wi& Web.

In Proc. of the 2 1 st VLDB Conference, Morgan Kaufinaag 1995.

C . Jenkins, M. Jackson, P. Burden and J. Wallis. Semching the World Wide Web:

An EvaIuatim of AvdabIe Tools ami Methodologies- In Proc. of Information and

Software Technology, Vol. 3 9, p98 5-994, Butterworth ScientSc, 1998

Serge Abiteboui, Richard Hd and Victor Vianu. F-om of Dafabases.

p 150- 156. Addison-Wesley Publishing Company, 1995.

U. Manber and P. A. Bigot. Connecting Diverse Web Search FacÏïities. Data

Engineering Bulletin, IEEE, Vol. 2 1 No. 2, June 1998.

A O. Mendelzon and T. Milo. FormaiModels of Web Queries. In Proc. of the

16th ACM Symposium on Principles of Database Systems, Tucson, Arizona,

h A ~ x r 1007- 'O-J ""

P. Buneman. Semistruc~uredDafa: A Tutoriaï. In Proc. of PODS, pages 1 17- 12 1,

Tuscon, Arizona, May 1997.

S. Nestorov, S. Abiteboul and R Motwani. Infemhg Structure Ïn Semisrrtrctured

Data. ACM Sigrnod Record, Vol. 26, No. 4, December 1997.

Y. Papakonstantinou, H- Garcia-Molina and J. Widorn. Object Exchange Across

Heterogenems Injoration Sources. in EEE International C onference on Data

Engineering, March 1995.

P. Buneman, S. Davidson, M. Fernandez and D. Suciu. Ading Sfnrctwe to

U m n ~ c ~ u r e d Data. In Proc. of ICDT, Springer, January 1997.

D. Quass, A. Rjaraman, Y. Sagiv and J. b a n . Querying Semistnrchrred

Heterogeneous Infmdon. In Proc. of the 4th International Conference on

Deductive and Object-Oriented Databases, S p ~ g e r , December 1995.

V. Bush. As We Mcs, Thinkk Atlantic Monthly. Vol. 176, NO. 1. 1945.

D. Chamberlin. OSing the Nou DB2. p 1-35. Morgan Kaufinam Publishers, Inc.

1996.