querying web pages with database query languages · querying web pages with database query...
TRANSCRIPT
Querying Web Pages with Database Query Languages
by
Xiaoyu Yang
Graduate Program in Cornputer Science
C. .Ln:u- f :- ,rc.cc=rl C.1Cll- ,,c 3 u u l 1 I 1 L t G u lu pQL LlQt LLlLLlllllltllL
of the requirements for the degree of
Master of Science
Faculty of Graduate Studies
The University of Western Ontario
London, Ontario
November 1998
Q Xiaoyu Yang 1998
National tibrary Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. rue Wellington Ottawa O N K1A O N 4 OttawaON KlAON4 Canada Canada
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distn'bute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent ê e imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
ABSTRACT
As the World Wide Web is growing at a phenomenal rate, it becomes more and more
difficuIt to retrieve information of interest from the enonnous number of resources that are
avaiiable. Currently, there are two ways to retneve information fiom the Web, namely,
navigationhrowsing and searcbg by search engines. However, these search methods
have significant limitations, such as, the "lost-in-hypenpace" phenomenor, the ignorance
of the hypertext structure, etc. These drawbacks motivated the development of a flexible
and powerfiil web query system.
This thesis presents a prototype system developed to query the Web with database
query languages. Ln our prototype system, the Web is modeled as a labeled directed graph
which can be stored in a relational database. A paner was designed and hplemented in
Our prototype system to extract the information of a web page 5om the source HTML fle
and store it into the database. Three query facilities are developed in the prototype system,
namely, the content query, the structure query and the advanced query, which can be used
to pose queries on both the content and the hypertext structure of web pages. Extensive
-experiments have been perfonned to test the prototype system. The testing results show
that database query languages can be used successfully in quemg the Web.
ACKNOWLEDGMENTS
I wish to express rny rnost sincere gratitude and appreciation to my supervisor, Dr.
Sylvia Osborn, for her time, invaluable guidance, support, understanding and
encouragement during the course of this work.
Thanks are also extended to al1 my niends and Wow graduate students for their
suggestions, encouragement and for the fiiendly environment they provided.
Most of aU, 1 would Wce to thank my parents and my husband. Their love and support
are invaluable,
TABLE OF CONTENTS
CERTIFICATE OF IGCAMINATION
ABSTRACT
AcKNOFYLEDGiMENTS
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
Chapter 1 Introduction
1 -1 Motivation
1 -2 Objectives ? - s - l r .=, 1 nesis S r g ~ ~ o n
Chapter 2 An Overview of Hypertext and the World Wide Web
2.1 An Introduction to Hypertext
2.1.1 A Bnef History of Hypertext
2.1.2 Hypertext Concepts
2 2 The World Wide Web
2.2.1 A Brief History of the World Wide Web
2.2.2 The World Wide Web Concepts
2.3 HyperText Markup Language (HTML)
2.3.1 A Brief History of HTML
2.3 -2 Common HTML Tags
2.3 -3 Examples
2.4 Searching the World Wide Web
2.4.1 Navigationlsrowsing
2.4.2 Searching by Search Engines
ii
iii
iv
v
viii
ix
Chapter 3 Related Work of the Web Querying
3.1 Issues of Queqing the Web
3.2 Modeling the World Wide Web
3.2.1 Object Exchange Model (OEM)
3 2.2 Araneus Data Model (ADM)
3 -3 Web Querying Systems
3.3-1 WebSQL, W3QS, WebLog
3.3.2 Wrappers Used in Querying the Web
3 -3 -3 Surnrnary of Web Querying Systems
3.4 A Review of Database Query Languages
Chapter 4 The Prototype Web Querying System
4.1 Data Model
4.2 A. Re!atimA V-ew ~f W E ! ~ U-de Weh
4.3 System Overview
4.4 Mapping HTML Fdes to the Database
4.4.1 Extracting Information fiom HTML Files
4.4.1.1 Extracting title Infornation
4.4.1 -2 Extracting Hyperlinks and Description
4.4.1 -3 Extracting Other Information
4.4.2 Storing the Information in the Database
4.5 Query Facilities
4.5.1 Content Query
4.5 -2 Structure Query
4.5.3 Advanced QUery
4.6 User Interfaces
4.7 Suppiementary Functions
4.8 Summary
Chapter 5 Querying the Web Pages
5.1 Query Methods
5.1 - 1 Content Query
5.1.2 Structure Query
5.1 -3 Advmced Query
5 -2 Experimental Resdts
5.2.1 Content Query
5.2.2 Structure Query
5 -2.3 Advanced @ery
5.3 Summary
Chapter 6 Discussions and Future Work
6.1 Discussions
6.2 Future Work
References
Viîa
vii
LIST OF FIGURES
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
li.:m.vn 2 C A 16U1 i.W
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
Figure 4.6
Figure 4.7
A SampIe Hypertext Structure
Common Tags in HTML
A Sample HTML File
A Sample Web Page
An OEM Graph
A Sample ADM Scheme
A Sample Web Page
The Architecture of the WebSQL System
A C m m r r l a A r r r h C + a m + v . r a - C A Kar l :n+rr - --ri lXT-n-.i-.-.- c L 3-p~r c u CIILIL-WL- e V A A r a w L u L w L auu T V L a p p ; 3
An Example of Labeled Directed Graph Mode1
A Sample Web Page for Course Description
The System Architecture
The Query Interface for Content Query
Results of a Content Query
The Query Interface for Structure Query
Results of a Structure Query
Figure 4.8 The Query Interface for Advanced Query
Figure 4.9 Results of an Advanced Query
Figure 4.10 The User Interface for Accessing Query Facilities
Figure 4. I t The User Interface for Supplementary Functions
LIST OF TABLES
Table 4.1 Refation webpage with 1 Tuple
Table 4.2 Relation webpage-d with 1 TupIe
Table 4.3 Relation links with 13 Tuples
Chapter 1
Introduction
1.1 Motivation
Since its creation in 1989, the World Wide Web has been growing at a phenorneml
rate. As a globai information resource residing on the Intemet, the World Wide Web
contains a large amount of data relevant to alrnost ali domains of human activity:
education, business, entertainment, art, science, politics, religions, etc. There are currently
tens of d o n s of documents on the web, and the number is growing. The explosion of
the World Wide Web, on the one side, is providing more and more information; on the
m t h m t &Ao, LC\~~~LLI~LLP ;+ -le-- m R . . e a ~ ~ ~ n h 1 a - e r\-a G . ~ A C ) ~ M . . + O I ~ ~ n C . 1 - -<+CI +Le Vv-eL :c- U U W L 3 L U W 7 L A V I V W T W A ~ L L GU* W U U w ~ L V U & W L L W r V L I W AULLU-WLlbLU ~ I V U L b L L & W l i l L L L l b W b U L 3
the difficulty of retrieving specïfïc information of interest to the user, fiom the enormous
number of resources that are available Cl]. The most cornmon technology used for
searching the Web is based on browsing the web pages by following links or searching by
sending information retrieval requests to "index servers" [2]. While navigation and
browsing are useful, they can lead to the weil-known "lost-in-hyperspace" phenomenon.
On the other hand, limitations aIso exist in using search engines to seek out the
idormation f?om the Web. One major limitation of search engines is that they provide only
keyword search, which means this kind of search cm not effectively use the way
information is structured in Web documents [3]. Therefore, complex queries regarding
hypertext structures are not dowed. For example, there is no way to find out the
hypertext Links of interest within a given web page by using search engines. Under these
circumstances, a flexible and powerfid search systern for querying the Web is really
needed.
1.2 Objectives
The main objective of this thesis is to design and implement a prototype system,
which can be used to query web pages with database query Ianguages. In order to
overcome the major limitation existing in search engines meneioned in section 1.1, Our
prototype system should have the ability of supporting structure queries, which are quenes
posed on the hypertext structures of the web pages. However, a new issue arises by
adding this facility to our prototype system, Le., how to represent the structure of the web
pages. Thus, one of the objectives of this thesis is to build up a data mode1 that is
comprehensive enough to capture some important aspects invohed in querying the Web.
DBerent fiom other web querying systems, our prototype system exploits an existing
database query language to query the web pages. Accordingly, another objective of this
thesis is to explore the beuefit that we can get from using the database query language to
query the web pages.
1.3 Thesis Organization
This thesis consists of six chapters.
Chapter 1 provides an introduction to the thesis.
Chapter 2 gives an overview of Hypertext and the World Wide Web. HyperText
Markup Language and the methods currently used to search the web are also introduced.
Chapter 3 introduces the issues regarding querying the Web and presents severai
related works for querying the World Wide Web.
Chapter 4 focuses on the design and irnplernentation of the prototype system. In this
chapter, the data mode1 used to mode1 the structure of the web pages is descnbed- A
system overview is provided. The method of rnapping fiom E3TML files to the database
and the query faczties developed in this prototype systern are aiso introduced.
Chapter 5 provides and discusses the experimentd results obtained &om three
different types of queries, i-e., content query, structure query and advanced query.
Chapter 6 concludes this thesis and offers recommendations for fùture work-
Chapter 2
An Overview of Hypertext and the World Wide Web
The World Wide Web can be considered as a huge hypertext system on the Internet,
where the hypertext nodes are simply HTML files residing on the £ile systems of certain
Internet hosts. This chapter provides an overview of the hypertext concept and an
introduction to the World Wide Web.
2.1 An Introduction to Hypertext
Hypertext is text with links. It dEers f?om traditional text in providing quick access &- -&a.-- - ..A- ," ,,,,=k ida& t ~ i ? h i r i t& is~? ~ùi~efi1ly behg ïe&. Kypeï~exi sir~c:iurt: is the
fùndamental structure of the World Wide Web, by which Web documents are organized.
2.1.1 A Brief History of Hypertext
Hypertext has a surprisingly rich history cornpareci to the World Wide Web. The first
system we wouId now describe as a hypertext system was proposed by Vannevar Bush as
- early as 1945. This system, the Memeq was never irnplemented, but was only described in
theory in Bush's paper [23]. It was described as "... a device in which an individual stores
his books, records, and communications, and which is mechanized so that it may be
consulted with exceeding speed and flexïbility." The actual word "hypertext" was coined
by Ted Nelson in 1965. Nelson was an early hypertext pioneer with his Xanadu system,
which he has been developing ever since. Parts of Xanadu do work and have been a
product &om the Xanadu Operating Company since 1990. The basic Xanadu idea is that
of a repositov for everything that anybody has ever written, giving a truly universal
hypertext systern. Nelson views hypertext as a literary medium and he beiieves that
"everything is deeply intertwingled" and therefore has to be on-line together. A final event
was the extremely rapid growth of hypertext on the Internet in the mid-1990s,
spearheaded by the specification of the World Wide Web by Tirn Bemers-Lee and
colïeagues at CERN (the European Center for Nuclear Physics Research Ui Geneva,
Switzerland). Detailed information about the history of hypertext can be found in [4].
2.1.2 Hypertext Concepts
The sirnplest way to defhe hypertext is to compare it with traditional te*. Al1
traditional text is sequentiai, Le., there is a single linear sequence defïning the order in
which the text is to be read. Generaliy, when we read a book, we read page one first, and
then page two, and then page three, and so on. Hyperîext, however, is non-sequential, that
is there is no single order that determines the sequence in which the text is to be read.
üsuaiiy, hypertext presents several different options for readers to explore rather than a
single Stream of information.
Figure 2.1 : A Sample Hypertext Stmcture
Figure 2.1 illustrates a sample hypertext structure. In this figure, 4 B, ... , F represent
units of information, which are calleci nodes; - represents an anchor, and-* represents
a W. Each of the nodes may have pointers to other units, and these pointers are called
h k s . Links provide the mechanism whereby nodes are connected to one another. The
node f?om which a iink originates is cded the reference. Points within the reference
where Iînks are dehed are referred to as anchors- The node at which a Iink ends is c d e d
the refeent [4]. As can be seen, the entire hypertext structure forms a network of nodes
and links. Readers move about this network in an activity that is oflen referred to as
br~ws~hg or nawzgmhg to ernphasize that users must actively determine the order in which
they read the nodes. For example, if a reader is currently reading node 4 the next node
the reader can choose is B, D or E. If the reader selects B, then the reader has alternatives
to either read ail the text in node B or jump to node C or F, and so on.
2.2 The World Wide Web
TL, -,- ,:cl 1-- ..-, C- t - - - ~ - , - r r 1 uruaL w ~ u ~ i y us& 5 ya~t ; l l ~ LUI 11ypt;l LGXL i~ tkii- 'A'ùi:d 'A~US W&. ii is a is0 une OL
the newest Intemet services. The World Wide Web has the ability to combine text, audio.
video, graphies, etc. together. Its hypertext structure provides quick access to other
related Web documents. Now, the World Wide Web is emerghg to be the newest and
most exciting tool for locating and displaying information on the Internet.
2.2.1 A Bnef Kistory of the World Wide Web
The history of the World Wide Web is fairly short. It was developed at CERN in the late
1980s. The purpose of the World Wide Web was to d o w anyone at CERN to easily
access and display documents that were stored on a server anywhere on the Internet* By
the end of 1990, the researchers at CERN had a text-mode browser and a graphical
browser for the NeXT cornputer. During 1991, the World Wide Web was released for
general usage at CERN. Initidiy, access was restricted to hypertext and UseNet news
articles. As the project advanced, interfaces to other Internet seMces were added, such as
WAIS, anonyrnous FTP, Teinet, and Gopher. In 1992, the World Wide Web project was
made public. People began to create their own Web servers to make their information
avaiiable to the intemet and to design easy-to-use interfaces to the World Wide Web. By
the end of 1993, browsers had been developed for many diEerent computer systerns,
including X Wmdows, Apple Macintosk and PCMrmdows. By the summer of 1994, the
World W~de Web had become one of the most popular ways to access Internet resources.
2.2.2 The World Wide Web Concepts
The World Wïde Web uses a client-server architecture for distributed hypertext that
can be accessed over the Internet. Servers run specialized software, cded H T P D
(HyperText Transfer Protocol Demon), which accepts requests that arrive over the
network, performs a fiuiction in response to that request, and then returns the results to
the requester. Servers are also regarded as a collection of Web documents including
hypertext files, image, video clips, sound files, etc., which c m be shared over the World XXCJ- vv ws x s r . . ~ v v GU. Aï execùtiig pîüm?i kûiiiu â " d i ë ~ t ~ ' whez ii i~ âbk ic ~ ~ i i d à TWÜSST iû P
server, await a response, and process that response [4]. A Web browser is such a client
that knows how to interpret and display documents that it finds on the World Wlde Web;
exarnples are Netscape and Microsoft Internet Explorer. AU the servers provide their data
to the client software in a standardized format cded H l U L (HyperText Markup
Language) through a standard communication protocol cded HTIiP OIyperText Transfer
Protocol). This combination of HTh4L and HTTP constitutes the hypertext abstract
machine and is the only point at which client and server computers need to agree.
The World Wide Web has a standard way of referencing a document by using a
Uniform Resource Locator (UEU,), no matter what the document's type is, for example,
text, sound file, etc. A URL is a cornplete description of a document, containing the
location of the document you want to retrieve. The location could be on your local disk or
on an Internet site halfway around the world. A URL can be set up to be absolute or
relative. An absolute URL contains the complete address of the document that is being
referenced, including the host name, directory path, and fila name. The forma1 syntax of an
absolute URL is:
where <Protocol> is a protocol that the Web browser can use to retrieve documents,
such as http, ftp, gopher, news, mail, etc.; <Most> is the server name; <Path> is a Unix-
style path for the file; <Filename> is the actuaI file name and <Locution> is a textual
label in the me. For exarnple, the foilowing URL is an absolute URL:
Protocol Host Path Fie Name Location
However, if the destination document is on the same Web server as the source document,
a relative URL may be used. A relative URL omits the protocol and host, or even the
patb that is, a relative URL onlv - s~ecifies - the subdirectory if a d a b l e and the file name.
The foiiowing example illustrates a relative URL that can be found in our example HTML
file in Figure 2.3.
Subdirectory Fie Name
What is worth mentionhg is that the equivalent absolute URL cm dways be constructed
fiom the URL of current document and the relative URL in the current document.
2.3 HyperText Markup Language (HTML)
HTML is the language used when writing a document that is to be displayed through
the World Wide Web. It will be apparent nom the example in Figure 2.3 that HTML is a
fairly simple markup language that describes how a document is stmctured. It is therefore
easy for people to w-rïte HTML files for distribution over the World Wide Web, and this
simplicity has been one of the factors in the success and growth of the World Wide Web.
2.3.1 A Bnef History of HTML
HTML was originally deveioped by Tim Bemers-Lee while at CERN, and
popularized by the Mosaic browser developed at National Center for Supercomputing
Applications (NCSA). During the course of the 1990s it has blossomed with the explosive
growth of the World Wide Web. In 1994, HTML 2.0 was developed to c o d e common
practice. HTML+ (1993) and HTML 3.0 (1995) proposed much richer versions of
HTML. In 1996, the efforts of the World Wide Web Consortium's KIUL Working Group
to codG common practice resulted in EITML 3.2. Now, HïML 4.0 is the latest version
with more powerfid and mature features.
2.3.2 Common HTML Tags
HTML is an application of SGML (Standardized General Markup Language) [4]. It
defines a collection of tags that can be used to publish on-he documents with headuigs,
texts, tables, lists photos, etc. and to retrieve on-line information via hypertext links.
These tags also provide a rneans to enable images, sound and even animation to be
- embedded in Web documents and to design foms for conductuig transactions with remote
services, for use in searchg for information, making reservations, ordering products etc.
Figure 2.2 contains some of the commonly used tags in an HTML document.
The most important feature of HTML is its ability to insert hypertext Links into an
HTML document so that other HTML documents can be Linked by these Links. Hypertext
Links in an HTML file are pointers fiom keywords appearing in the document to a
destination. The destination could be another HTML document or a resource such as an
externai image, a video clip, or a sound me. HTML supports hypenext Links through an
anchor tag <A> in the form oE
where HREF stands for Hypertext REFerence; DestClRL is the llRL of the
destination document and AnchorText is the text to appear as an anchor when the
document in which this hypertext link is defined is displayed by the Web browser.
2.3.3 Examples
Basicaiiy, an HTML document consists of two parts: the head and the body. The head
contains meta-information about the document. It is specified using the tag <title> ...
e/+;+la\ r\r + m e /--+.-.\ TL- f -A,- ,---:-- - t t * fa A:--I---LI- ,--~-,a 7- - -, -r V* '-5 - , ~ A ~ L Q ( . A UG U V U ~ b U I l L Q U 1 3 au LUC unpayauuz GC(ZICI;IL uscr~ c~an navigaie
over the various documents by activating the hypertext Links of interest. The display that is
the result of vieming an HTML document using a browser is caiied a page. Figure 2.3 and
2.4 show a sample HTML fde and the comesponding file displayed in a Web browser.
Page Markup
Hypertext Links
Inline Images
Form EIements
Table Elements
List Elements
Structurai Markup
Style Markup
Document titIe
HTML document definition
Body section definition
Head section definition
Hyperlink (anchor) definition
Image
FU-Out forms
Input element
Table definition
Standard table data cell
T&!e henr'el- ce!!
Table row
Unordered list
List item
Menu List
Paragrap h break
Line break
<h 1 >. . .<h6> . . -4h 1 >. . .</h6> Level 1 -6 headings
ci> .. . </i> Italics styIe
<strong> . . . </strong> Strong emphasis style
- - - - ---- - - -
Figure 2.2: Cornmon Tags in HTML [4]
e - m M D
C HE AD>
-Xiaoyu Yang's Home Page d ï i i U 3
<BODY BGCOLOR=Vnma",
<CENTERxIMG SRC="we1comeee.gif" ALZGN=middlexBRxBRxBR.
<FONT SIZE=f6 A L I G N = m i d d i e x I ~ B ~ t o Xiaoyu Yang's homepage</BLIM(></I></FONIkBRxBR>
CIMG SRC=*grapegrapevMe.gif" ALIGN=middlexBRxBRxBR>
qAl3I.E BORDER="Ow CELLPADDING=" 1 0 3
VALIGN="middlewI%+
<A HREF="persoxxal.html~LMG SRC=" l384.gif" HSPACE=15 BORDER=@
6 T R O N G > P e r d uiformation dSTRONGWA><BR,
<A HREF="re~earch.html3<IMG SRC=" 1384.gif" HSPACE=15 BORDER=@
<STRONG>Research Work </STRONG></AxBR,
<A HREF="/faculty/sylviah~~MG SRC=" l384.gif" HSPACE=15 BORDER=@
6 T R O N e M y Supenrisar dSTRONG></A><BR>
<A HREF="courses.htmiS<IMG SRC=" 1384.gïf'' HSPACE=15 BORDER=@
6ïRONG>Courses </STRONG></AxBR>
CA HREF="tahtd"xIMG SRC=" 1384.gif" HSPACE=I 5 BORDER=O>
eTRONG>TA 4STRONG>(/AxBR>
<A HREF="in&est~.html3<IMG SRC=* 1384.gif" HSPACE=15 BORDER=@
4TFtONGXnteresting Links </STRONGWA>
-><IrABLExBR>
CIMG SRC="gapepevine.gif" ALIGN=middlexBRxBR><BRxP>
<A H R E F = " h t t p i / m . c s d u w o . c a ~ ~ e n t of Cornputer Scien&A> (
<A HREF="lunages/Middlesexhtml~Middl~x CoiiegedAxBR>
<A HREF="httpJ/www.uwo.ca.>The University of Western Ontarioc/A> 1 <A HREF="httpi/www.city . londoa01~ca/sLondon4 1 <A HREF=*http-I/~~~~gov.onca/s(hitario4A~ 1 cA HREF="httpJ/canadag~.ca/"I:anada4A> <BRxBRxBR>
This page has been accessed
CIMG SRC="http~~-me£asa£asacom~cgi-bin/nph£omt?width=6&~http~/~.csduwo.ca/gradstudents
/students/xyang/index.html~ since Juiy l,l998.</P>
</CENTEIixHRxBRxBR>
CIMG SRC="painting.gif"> <FONT S I 2 E = l x b Last modifiai by
<A H R E F = " m a i l t o ~ g @ c s d ~ ~ o ~ c a ~ X i a o y u Yang(IA> on February 18,1998.4WEONT>
(IBODY>
4HTML,
Figure: 2.3 A Sample HTML File
to Xiaoyu Yang's Homepage
Personal Information Research Work MY Supervisor Courses - TA - Interestinpr Links
Figure 2.4: A Sample Web Page
2.4 Searching the World Wide Web
One of the real advantages of the World wde Web system is that ordinary users cm
create web pages that users anywhere on the Intemet can display. This feature aiiows
ordinary users to publish information that can be used by the entire world and also results
in the rapid growth of the amount of information available on the World Wilde Web. There
are software programs, graphies, magazine articles, job postings, govemment reports,
weather maps, and thousands and thousands of documents, and so on. Therefore, it is not
aiways easy for users to find what they want - or to even know how to h d it. Currently,
there are two main methods that are used to search the information of interest:
navigatiodbrowsing and searching by search engines.
This is an excellent method of locating information which a user may not have
considered available on the World Wide Web. It involves starting somewhere and just
following the Links. This is a simplest way to fhd information, but is not a reliable rnethod
to End a particular piece of information on the World Wide Web. As mentioned in Chapter
1, readers often experience the "lost-in-hyperspace" phenornenon when navigating the
World Wide Web.
A solution to the navigation problem is to provide users with classSed directories
[14], which can guide users to useful resources on a particular subject or of a particular
type. An excellent exarnple of this is Yahoo, which d o w s the users to search through its
hierarchy. Since documents of a sirniIar subject/type may be grouped together, this method
obviously can narrow the scope of a search. Navigation by classified directories, however,
is sometimes Limited by the documents available and there is stiU a risk that users may
becorne disonented or have trouble to fhd the information they need.
2.4.2 Searching by Search Engùies
The need for retrievhg information Eom the World Wide Web has led to the
development of a number of search engines. They search the Web accordhg to the
keywords or phrases s p d e d by the user and retum the resuits which are relateci to the
keywords or phrases. Typically, search engines are cornposed of a resource locator (also
known as robot) and a search interface. Searches are based on an index database which
stores the information of the web pages. The resource locator is run penodicdy to gather
information f?om the Web and create and update the index database. The search interface
takes a user query, passes the request to the Web semer, which performs a retrieval fkom
the index database and rehirns the results. The results appear in hypertext and can
immediately be selected to link to the required documents.
.fm".. &xt"r2i~ 5Mr2h eTi&îca "AL' A:CL"--a ----L:lL2-- --'A L v ~ ~ ~ ~ * r u 1 u c i c i l ~ bapaumun cmar ûfi GiEeicnî se[-vers.
Some of the search engines are: Alta InfoSeek, Lycos, etc. Although there are a
large number of search engines available on the Web, they are all used in exactly the same
simple way: type in some text, and get back a hypertext answer which points to things that
were found by the search. It is an easy and usefùi way to search the World Wide Web by
search engines. However, as Martijn Koster says in his article [6], robots "wiU become less
effective and more problematic as the Web grows". The major Limitation of search engines
is that they support only keyword search, which means it is impossible to pose query on
hypertext structures. For example, assume we know the URL of the home page of
Computer Science Department at the University of Western Ontario and we would iike to
be able to restrict the search to only pages directly or indirectly reachable fiom this page.
With currently available search tools, this kind of query is not possible. There are also
some other drawbacks with search engines, such as they cannot be adapted easily to the
requirements of a specific user, their query language is poor and they ofien return too
many answers, badly ordered and rnainly irrelevant. These Limitations stimulate the need
for new and powerful search tools.
Chapter 3
Related Work of the Web Querying
The WorId Wide Web is a distributed, ever growing, giobal information resource. The
rapid growth of the Web makes the wealth of information become more and more dficult
to mine. Cumentiy, there are ody two ways to search the information available on the
Web: navigationhrowsing or searching by search engines. These two methods, however,
have important limitations, as stated in Chapter 1. Thus, the situation here is we have an
invaluable information resource, but c m not use it effectively. This compelling need for
querying the Web in a flexible and powerfbi way has led to the development of a number
of new web querying languages and systems. In this chapter, we focus on the related work
in the area of querymg the Web. First, the important issues and the f icu l t ies existing in
queqing the Web are presented. Then, data models used to describe the Web are
introduced. FmaIly, several on-going Web queryïng projects are discussed.
3.1 Issues of Querying the Web
One important and fundamental issue of querying the Web is the design of a data
. model, which should be comprehensive enough to capture most of the important aspects
involved in querying the Web. On the Web, data consists of files in a particular format,
HTML, with some structurllig primitives such as tags and anchors. GeneralIy, the
structure of H'ïML mes is irregular, irnplicit, partial and fiequently changing. These files
do have some structure but it is too irregular to be easily modeled by ushg a relational or
an object-oriented approach [19], especially when the structure is nested or cyclic.
Accordingly, these kinds of files are called semi-structured files [18]. How to modei serni-
stmctured files is thus an essential issue in querying the Web.
Another important issue in the area of q u e m g the Web is extracting information
nom the Web. The irregularity of the structure of web pages results in the ditnculty of
extracting the information. This problem has been studied and partiaiiy solved for SGML
documents [7, 81. The idea used here is to map the underlying grammar of the document
to an appropriate database schema. Thuq when the document is parsed by using this
gramrnar, corresponding objects would be created in the database. However, when deaiing
with HTML files in the same way, grammars show important limitations. Fïrst, the
structure of HTML files is not always completely defued. Second, the structure can be
irreguiar and H T b E files ofien contain errors, in the sense that they do not fùlly comply
with HTML grarnmar niles, e-g., missing tags are a common example of these errors.
Moreover, information gathering on the Web lays its emphasis on navigation via
hyperlinks that relate documents to one another. Under these c~cumstances~ the design of
a parser or other tools used to extract information fiom the Web becomes more difficult.
A query language is also a very important issue of web querying and an absolutely
necessary component of web querying systems. Basically, this kind of language should
have the power of traditional query languages and also support richer data types, ailowing
recursive queries, etc. Recently, query languages for the Web have attracted a lot of
attention [10]. Several SQL-iïke query languages have been designed for the Web,
although these languages still need a sound theoretical foundation. Another trend of web
querying is to exploit the existing database query language. These languages are based on
weil-dehed theory and are fairly mature, thus should be able to provide more powerful
query facilities.
3.2 Modeling the World Wide Web
As mentioned above, web modeling is a very important issue of web querying. There
are several data models that are proposed to describe the Web. In this section, two
dEerent data models, namely, OEM (Object Exchange Model) [9, 20, 221 and ADM
(Armeus Data Model) [ I l ] are introduced.
3.2.1 Object Exchange Model (OEM)
Object Exchange Model (OEM) is proposed to represent semi-structured data. It is a
simple, self-describing model with object nesting and identity. Data represented in OEM
can be thought of as a graph, with objects as the vertices and labels on the edges. Entities
are represented by objects. Each object has a unique object identifier (oid), a IabeI and a
value. The Iabel is a string denoting the ccmeanhgy" of the object. The value can be fiom
one of the disjoint basic atomic types, such as integer, red, string, &, titml, audio, etc.
The value can also be a complex object which is a set of sub-objects. An object is thus a 3-
tuple: <oid, label value>. A database D = < O, N > is a set O of objects, a subset N of
which are named objects. The intuition is that named objecîs provide 'entry points77 into
the database fiom which sub-objects can be requested and explored- An OEM database
c m also be easiiy viewed as a relational database with a binary relation VAL(oid, Errc ,.rr-lf Gr.- 4L.. r,lr.-- -L. ...Lm-:- -f:--irL- QZ.id Ci A-----
u r v ~ r l i ~ - v C U U ~ J LW ~ p ~ u ~ y uie uic v r u u ç ; ~ UA a~uiiri~ u u j ~ t 3 ilai y r&àiiùn
MEMBER(oid 1, label, oid2) to spec* the values of complex objects. As can be seen, the
design of OEM is intended to make it a simple, flexible and powerfiil data model to
describe semi-stnictured data.
There are some minor variations on the OEM graph model that are used in querying
the Web [21]. These data models use a labeled graph or a graph schema, with nodes
representing web pages and labels representing hyperlinks.
Figure 3.1 illustrates an OEM graph. In this figure, &19 is the identifier of an object
whose complex value is a set containing, arnong others, the pair ("category", & 17). & 17 is
the identifier of an atomic object whose value is "gourmet".
Figure 3 .1 : h OEM Graph [ 1 O]
3.2.2 Araneus Data Mode1 (ADM)
The Araneus Data Mode1 (ADM) is a page oriented data model for the Web. This
means the main consmict of the model is that of a page scheme. Each page scheme
descnbes the structure of a set of homogeneous pages. Each web page is thus considered
as an object with an identifier (the URL) and a set of attributes, one for each relevant
piece of information in the page. The attributes used to describe a web page can be either
simple, like text, image, or link to other page, or complex. Complex attributes are
essentiaiiy lists of items, possibly nested. Based on this perspective, the ADM scheme cm
be seen as a collection of page schemes, comected using links. Figure 3.2 shows an ADM
scheme with one of the example page schemes correspondhg to a web page shown in
Figure 3 -3.
Figure 3.2: A Sample ADM Scheme [l I l
Leonardo Da Vinci
Figure 3.3: A Sample Web Page [IL]
Figure 3.3 shows a sample web page containing the publications by an author. Its
corresponding page schema AuthorPage can be found in Figure 3.2, which shows an
ADM scheme for the DB&LP Bibtiography home page. For each author in the DB&LP
Bibtiography home page there is a similar web page, and al of these web pages share the
same structure. Therefore, they can be described by the same page scheme. The
AuthorPage scheme has two attributes: Name and WorkList, which is a List of
publications, i.e., a set of nested tuples. For each paper Listed in the Worklist, there is a
page scheme, Conferencepage or JounalPage. And there are attributes contained in these
two page schemes, and so on. As can be seen, an ADM scheme deals with structured web
pages in which data are organized according to precise structures and web pages present
strong regularities. Therefore, it is suitable to build database abstractions of large and
fairly weii-structured web sites.
3.3 Web Querying Systems
Several web querying languages and systems have been recently proposed. Most of
the efforts are concernai with issues such as the development of data models and query
languages for the Web, denning formai semantics for the proposed languages and
implementation issues. In some of these systems, such as WebSQL [3, 171, W3QL [13]
and WebLog [l], there is a very simple notion of scheme, and web pages are considered
within a single type, Le., as nodes in a graph, with at rnost a fked set of attributes. In other
words, these kinds of systems use OEM or variants of OEM to model the Web. There is
also a kind of web queryïng system which intends to deploy the regular structure
presented in web pages and therefore uses a more complex data model. For example, the
Araneus Project [I l ] uses a page oriented data model to d e m i e the Web. Another trend
to retneve information &om the Web is to focus on the generation of wrappers, wbïch can
facilitate database-like querying of semi-stnictured data retrieved directly fiom Web
serven. This kind of web querying system is cded a mediator-based system. Related
work in wrapper generation and mediator-based systems can be found in [12].
Tn thir section, we lrst h t - d ~ e WehSQL, 2 web n,~e~.;,?g p;stem b.S,d c:: z sirnpk
graphic data model. The other two systems similar to WebSQL, ive. W3QS and WebLog
are also introduced, but ody the ciifference is emphasized. The mediator-based systems
present a different system architecture because of the existence of the wrapper and
rnediator. We illustrate in this section the architecture of the wrapper and mediator and
also introduce the generation of a wrapper.
3.3.1 WebSQL, W3QS, WebLob
WebSQL is developed at the University of Toronto. Its query Ianguage is an SQL-like
language for querying Web sources by exploithg the structure and topology of the
document networks. The distinct feature of WebSQL is that it provides a fomal semantics
and emphasizes the distinction between local and remote documents. Figure 3.4 provides a
system overview of WebSQL.
Traversal and Index Querying
WebSQL Compiler
{ the World Wide Web \
User interface
Figure 3 -4: The Architecture of the WebSQL Syaem
Object Code
In the WebSQL system, the User Interface accepts the user query and passes the
query to the WebSQL Compiler, where the user query is parsed and translated into a
custom-designed object language. When the Vutual Machine receives the object code
generated according to the query, it executes the object code and sends the requests to the
Query Engine which finaily performs the query, extracts the information of interest fiom
the Web and retums the results to the Query Engine. M e r the Query Engine passes the
results to the Vimial Machine, the Vïmial Machine turns the results into the HTML forms
and then displays the results to the user [3].
In WebSQL, the hypertext structure is represented by a graph data model [17J. This
model then can be viewed as a relational model composed of two virtuai relations: one for
web documents and the other for anchors in web documents. The relational abstraction of
the Web d o w s one to use an SQL-iike query language to pose queries on both content
and hypertext structure. Although the WebSQL quev language is designed as a subset of
SQL, it is a simulation of SQL. Therefore, it cannot be as powerfùl as SQL and a lot of
- Requests Results Virtuai Machine Query Engine
Lists of URLs
work needs to be done in designing the quexy language, such as query opthkation, etc.
which actuaiiy has been well studied in existing database systerns.
W3QS, developed at the Technion, Israel, is a system for SQL-like querying for the
Web. The system architecture is slightly different from that of WebSQL. The feature of
W3QS is that it interfaces to user programs and UNIX seMces for anaimg and Htering
semi-stmctured uiformation fiom Web servers. It d o w s the use of PerI regular
expressions and c d s to UnUr programs frorn the "where" clause of an SQL-lïke query, and
even c d s to Web browsers. Moreover, the language has been designed to be highly
extensible, and tools for managing Web f o m encountered during navigation are
presented [ 131. Again, advanced database techniques are not exploited in W3 QS either.
Different fiom the above-mentioned two web querying systems, WebLog, developed
at Concordia University: Montreal and University of North Cailina, emphibe~
manipulating the intemal structure of Web documents. Its query language is based on
Datalog-like recursive niles (11.
3.3.2 Wrappers Used in Querying the Web
In a mediator-based system, wrappers are the essential components built around
individuid information sources. They are used to accept queries from the mediator,
translate the query into the appropriate query for the individual source, and retum the
resuits to the mediator. They make the Web sources look like databases that can be
queried through the mediator's quexy language, Le., a database query language or a
cust om-designed query language. Figure 3.5 shows an example of mediator architecture.
In this figure, sources represent several related Web sources in a particu1a.r domain of
interest. AU of them should conform to the same format. Mediator here is used to
integrate information from multiple Web sources, i.e. Source 1, 2, 3, and it is made for a
particular domain of interest.
Figure 3.5: A Sample Architecture of Mediator and Wrappers
When a wrapper is generated for a new Web source, the following steps are involved.
First, the web pages need to be structured, i-e., iden-g sections and sub-sections of
interest on a page. Then, a parser shodd be built for the source pages to extract the
sections of interest. Findy, communication capabiiities between the wrapper, mediator
and Web sources shodd be added so that wrappers can fetch the pages containing the
requested information from the Web source and return them to the mediator. The key idea
of generating a wrapper is to exploit formatting information in web pages to hypothesize
the underlying structure of a page. Once the correct structure is obtained, a wrapper for
the source can be generated without much effort or time and information of interest can be
obtained. When web pages are loosely structured, such as personal home pages, building a
wrapper becomes a difiicult task [12].
3.3.3 Summary of Web Querying Systems
Ail the current systems used for querying the Web provicie a query language. These
languages are either SQL-like or Datalog-üke and aiIow for expressing both structure
spec@ing queries, based on the organization of the hypertext, and content quenes, maidy
based on information retrïeval techniques, e-g. search eagines. In mediator-based systems,
database query Ianguage can be used as a mediatofs query language as weil as other SQL-
like query languages. This kind of system is, however, suitable only in the cases that web
pages present strong structures. For those web pages that are loosely structured, most of
the systems exploit an SQL-like or Dataiog-like query language to query them. None of
them exploits a database query language and thus cannot benefit eorn the advanced
database technologies. HaWig this in mhd, we are m g to develop a system that uses an
existing database query laquage to query the Web and show the power of a database ciucïy* &-lg&2g k, -*-< waT.,2 q 5
3.4 A Review of Database Query Laquages
Database systems have been in existence for more than 30 years and have been
successfùily used for a wide range of areas of applications, such as in business, industry,
scientiiic research, engineering, and most recentiy on the World Wide Web.
One of the major purposa of database systems is to store data while providing ad hoc
query facilities to query these data To accomplish this purpose, Dr. E. F. Codd proposed
the relational data mode1 based on strong mathematicai foundations in 1970. During the
1970s, research and development work on relational database systems was carried out and
several prototypes were developed. The SQL-based database systerns of the 1980s
provided, for the fkst tirne, a single language to span the whole range of applications, with
support for multiple views of data and independence nom physical data structures. Since
that time, relational databases have grown fkorn strength to strength. Because of the
success of these systems, relational databases have become ubiquitous and SQL has
become a world standard database language. It was, however, discovered soon that
traditionai database query languages have less expressive power and Limitations exist. For
example, they support only limited data types and can not compute arbitrary transitive
closures. Some of these challenges faced by reiationai systems derive fkom the need to
store and retrieve new m e s of very large objects with cornplex state and behavior, such
as multimedia objects and data from the Web. In the mid- 1 %Os, a new type of database
system was emerging to meet these challenges, that is, object-oriented database systems.
These systems address many of the weaknesses of relationai databases by providing
object-oriented features, supporting a ncher data type. Recursive traversing of object sets
is aIso possible in these systems. Meanwhiie, extensions to the relationai model have been
defined recently by introducing the concepts of an Abstract Data Type (ADT) and nested
relations to the relational model to improve its object-orientation, leading to so-cailed
& j e f l - r & ~ ~ d d;?t&pse rywms. R-&fi~gA yery I ~ q ~ g e s h z v ~ eqen&d
and new features have been added. For example, SQL3 (Stmctured Query Language) is
an effort to turn ANS1 SQL-92 into an object-relational query language. Compared with
SQL-92, the new features of SQL3 include not only M e r developments and extensions
of existing concepts, but also some completely new concepts. One extension in query
facility is to extend query possibilities, for example, by using recursion. New features of
SQL3 include supporting ADTs, nested relations, etc. One exarnple of such systems is
DB2, which is a substantial advance over traditionai relationai systems. The new features
of DB2 include major innovations in query optimization, recursive union, active databases
(triggers), and stored procedures [24]. It integrates object-oriented ideas with the SQL
language to produce an object-relational database management system and provides new
functions and data types, including data types for stonng large objects. More importantly,
it provides a means for users to d e h e additional hctions and data types of their own to
meet the specialized needs of their applications.
Evolving till now, database technoIogy has become fairly mature. It is weli known
that database systems offer efficient and reliable technology to query stmctured data. It is,
however, a new chalienging issue to apply database techniques to the poorly stmcmed
Worid wde Web. Once web pages can be descnbed by a database schem the
management of web data must be able to highly profit fkom the database technology. Our
attempt to build the prototype system on top of a database systern and then use the
database query language to query the web pages is inspired by this idea The database
management system we use for our web querying system is DI32 Version 2 for common
semer. As cm be seen in the later chapters, we benefit a lot h m the powerful query
capabilities of DB2 and we are released fiom the design and implementation of a new
query Ianguage for the Web by using an existing database query language.
Chapter 4
The Prototype Web Querying System
Our prototype system is designed and implernented for querying the Web by using
database query languages. The system provides three h d s of quenes: content queries
which are quenes posed on the content of web pages; structure quenes which are queries
posed on the underlying hypertext structures of web pages; and advanced quenes which
are arbitrary queries posed on the Wtual relations of web pages. The structure of web
pages is rnodeled in our prototype system by a simple labeled directed graph. This chapter
describes the low-level design of the prototype system, including the underlying data
model, the virtual relations, the parser used to map the HTML, files to the database and
search facilities developed in the prototype.
4.1 Data Model
The World W~de Web is a large, heterogeneous, distributed collection of documents
comected by hypertext links. At the highest level of abstraction, it c m be viewed as a
graph whose nodes are web pages that are identifïed by URLs and have some arbitrary
attributes. In Our prototype system, the World Wide Web is rnodeled as such a simple
Iabeled directed graph- This model can be viewed as a variant of OEM.
In our data model, each web page is represented as a node in the graph. Each node
has a unique identifier, a label and a value. The identifier of each node is the URL of the
comesponding web page. The label is a string, which is the AnchorText [see section 2-3-21
that describes the hyperiink. The value is a set of attributes describing the node.
Labels are also attached to edges. K a node contains hyperlinks, it must have outgoing
edges to another node. If a node has no hyperlinks, it does not have any outgoing edge
and therefore is a leaf node. For any two nodes x, y, there c m be at most two edges with
different directions between x and y. A node can have at most one edge that points to the
node itself. As c m be seen, the data mode1 is a simplified one since it dows for only one
ünk in any one direction between two pages. Figure 4.1 illusmites the labeled 5rected
graph modeling parts of the web pages for the Department of Computer Science at the
University of Western Ontario.
Figure 4.1: An Example of Labeled Directed Graph Mode1
Figure 4.1 illustrates the structure of 9 web pages found in the Department of
Computer Science web site on August 2, 1998. These web pages are represented by
nodes, which are circles fUed with light gray. al, &2, ..., &9 represent URLs of
corresponding web pages, i.e., node 1 to 9. Labels are descriptions of links between
nodes. A possible edge for a node could be an incoming dge, an outgoing edge or a loop.
An incoming edge of a code is a Link that points to this node. For example, node 2 has an
incoming edge fiom node i labeled "About the Department". An outgoing edge is a iink
pointing to another node. For example, node 1 has an outgoing edge to node 2 Iabeled
"About the Department". A node has a loop when it has a link points to the node itself.
For example, node 9 has a b o p labeled "Return to Top".
4.2 A Relational View of the World Wide Web
m e r the structure of web pages is represented by a labeled directed graph rnodel, we
can easily view the Web as a relational database. The only difnculty here is to define the
value that is a set of attnbutes describing the nodes. The set of amibutes could be very
compleq reflecting the intenial structure of a web page. Each attribute could be related to
a s m d piece of information presented in the web page. For example, a number of web
pages which provide information for Course Descriptions c m be found by the following
LE : ha+:,/*.. r2giL*r.- L~VV. CGUCCU~S'!?? Z k 5 - i 5 . h ~ . Eâch ûf the =Y ~ Z ~ C S
presents the same structure as that shown in Figure 4.2.
0 Course Description
Com~uter Science 41 laib Databases II A seleaion fiom the foliowing topics: dependency theory, object-oriented databases; distributed databases and related dgorithms; database hardware; information renieval,
Antirequisite: The former Cornputer Science 4 10a
Prerequisite: Cornputer Science 3 19aib.
3 lecture hours, haifcourse,
Figure 4.2: A Sample Web Page for Course Description
For these web pages, we can use a set of attributes, such as (Description, Antirequisite,
Prerequisite, load) to dethe them. In this way, web pages can be describeci precisely and
there is less chance to lose Uiformation in web pages when they are mapped to the
database. Most web pages, howwer, are loosely structureci. For example, personal home
pages. Aimost each of them has its unique style. To define attributes Wce what we do in
the above example is almost impossible. As a tradeofS we take a minimalist approach to
determine the attributes, which captures only common features of web pages.
Generally, in an HTML f3e corresponding to a web page, there is always a pair of
title tags, i-e. <title> Title </title=+, which provides the title information of that page. This
information can be used as an attrïbute describing the web page. Also, there is some other
general information that can be found in a web page, such as the narne of the author, the
number of links contained in the web page, last modifieci date, and the size of the
~nrrespmding -HI'?&- a-, etc. IIe~re, i set cf z*ih~tes iis& tc describe z xbitru~"; vyel?
page can be obtained, e.g. (title, author, linkno, last-moaified, size). Once we have
assigned a vaiue to a node, Le., a web page, we can associate a node with a hiple in a
webpage relation:
Here, the uri represents the UEU of the web page and is thus the primary key. Except for
Iinkno and size which are integers, all other amibutes are character strings. Except for the
priinary key, d other attributes may be null. As can be seen, this relation provides the
generd information for web pages. It gives a web page a highly abstract description.
However, when web pages are mapped into this relation, some of the information may be
lost. As a result, content queries cannot be executed precisely. In order to overcome this
Limitation, we need a supplementary relation for the node. This relation is defhed as
follows in our prototype:
webpage - d (urt, content)
Each node in the data rnodel is related to a tuple in this relation and the primary key url is
the URL of the node. Attribute content is associated to the whole HTML file of a node.
Here, we take the advantage of the new data type 'CLOB (Character Large Object)'
provided in DB2. This data type can contain up to two gigabytes Q3' - 1 bytes) of single-
byte ciiaracter data It has the ability to hold a whole HIML Be. By using this r,dation,
information in a web page will not be lost when the web page is mapped to the database.
Hence, content quenes can be executed precisely. The reason why we use two relations to
describe a web page is that relation webpage-d is used for content queries only and
relation webpage is used for constructing query results.
One motivation for developing new web q u e m g systems is that current search
engines cannot use the structure in the Web documents. To address this problem, new web
que*g - - systems should provide the function of q u e ~ n g the h~ertext structure To
irnplement this hction, there should be a relation in which the relationship between two
nodes is represented. In our data model, we capture the information present in a hyperlink
as a tuple in a links relation:
where uri-a and url-b are the URLs of the origin and destination of the iink, i.e., url-a,
uri-b correspond to nodes A and B with relationship: @=@ Ail these
attributes are character strings. The primas, key for this relation is the combination of
urI-a and url_b. Ody description may be n d .
Based on the labeled directed graph model, the three relations introduced above
model the Web as a relational database. They capture both the information and the
hypertext structure presented in web pages. This relational abstraction of the Web allows
us to use a database query language to pose the quenes on both content and structure.
4.3 System Overview
Conceptuaiiy, our prototype web querying system has the following components:
Interfaces, that accept the queries, present the results and guide the users to other
fundons provided by the system;
A parser, that exiracts the infomtion fiom the HïML fiies and creates tuples in
the database for each web page in the database;
Query facilities, that invoke the appropriate search processes to provide content
query, structure query and advanced query;
Supplementary fùnctions, that provide facilities for the users to maintain the web
pages stored in the database.
User
-7- result s ot her operations
Interface L
A 4 4 1
operations operations on database results on local £iiq results
'--T- Y/;< request s
r%l Disk
the World Wide Web m m Figure 4.3: The System Architecture
Users interact with this prototype systern via an intefiace, in which they can choose to
pose a query or do other operations, such as adding a web page to the database, deleting a
web page fiom the database, or displaying the information of a locaiiy stored HTML file.
When users decide to pose a query, there are three kinds of quenes with three dZFerent
interfaces provided to the users, namely, content query, structure query and advanced
query. M e r users specirjl a query, a corresponding query process is invoked and a query
is performed on corresponding relations. Finally, the query results are displayed to the
users. Currentiy, our test web pages are locally stored and paned by a parser so that
information fiom a web page can be stored in the three relations introduced in the
previous section.
4.4 Mapping HTML Files to the Database
The 1--- --------c i iio nsy ruliipvticxir ûf iiiâpefig âi1 X E v Z Eb iü d a i d j ~ b k i h ~ p & ï ~ ï . ii
provides a means of extracting the uiformation of interest fiom HTML files and storing
them into a database, e-g., the three relations mentioned above. In this section we
introduce in detail how the parser rnaps HTML mes to the database.
4.4.1 Extractkg Information from HTML Files
In Our prototype, the World Wide Web is modelai as a labeled directed graph, which
cm be represented by three relations in a relationai database. These three relations are:
webpage ( id , title, author, linho, last-modzFed, size); webpage - d ( i d , content); links
(ru-a, uri-b, descripton). Hence, the information we need to extract fkom an HTML file
is related to the attributes in these three relations. Assume we are parsing a web page
whose comesponding HTML file is calleci samp1e.hmil. What we would Like to extract
from sample. htrni are the URL. of this web page, title, author, the nurnber of hyperfinks
contained in this file, last modified date, size of the file, ail the hyperiînks that can be found
in this me, and their corresponding description. The algorithms used in extracting the
information fiom an HTML me are described in the next severai sections.
4.4.1.1 Extracting title Information
The Me element is common in ali HTML files. The ElTML DTD (Document Type
Definition) specifies that a <title container be included in an He and there should
be only one <title container in any file [4], although there may be exceptions in some
HTML files. Generally, the rirle should identi@ the contents of the document in a global
context and the tïtle text should be included between <Me> and </Me> without any other
markups, such as anchors, paragraph tags or highlighting. The syntax introduced above
makes it easy for the paner to extract the titIe information fiom an HTML file. The
algorithm is as simple as looking for the string starting with <titIe> and ending with
c/title> and then extracting the string between <titi@ and dtitie?
4.4.1.2 Extracting Hyperlinks and Description
HTML supports hyperlinks through the anchor tag in the forrn of
The anchor object within a <a> container consists of text or another type of object, e.g..
an image. This mchor object when defined within a web page defines a hypertext
relationship to another web page. Both the start and end tags of the -> container must be
specified. It is the obligation of a browser to display an anchor object in a distinctive
manner so that its role is obvious to a reader. Based on this syntw extracting a hyperlink
£tom an HTML file has the following steps:
step 1: look for the string starting with "Ca ";
Note: Since the definition of a hyperlink starts with ff<a'' and there rnay be some
other markups between "a" and "href ', we can not simply look for "Q href' to
find an hyperlink.
step 2: if "<an is found, keep on looking for the string startuig with "hrefc"";
step 3 : if "hre+"" is found, extract the characters that foliow "bref-"";
step 4: stop when encomtering " "" .
By these four steps, we can obtain the Dest(lRL, which is the URL of another web
page. Now, we should continue extracthg the anchor object that is the description of the
DestURL .
The rmchor object may be a string or something else, e-g. an image. For exarnple, if
the anchor object is an image, it c m be defined as foIIows:
<a bref=" personal. htrnlIt><img src="personaLgif"x/a>
Here, the anchor nhject i ~r imsge, &&red hy " c i m m -.a da- c ~ ~ = l ~ n e r c o n - l rur dUAIuA-bAA (~ ;P IY , VXA;PL .. Lu-Ll
references the graph to be displayed as the anchor for the Link. Since it provides no text
description but the file name of the graph, we shall only assign an empty string to the
anchor object in this case.
Shce &or objecr appears between <a> and </a> foilowing the defition of a
hyperhk, we cm begin to extract the anchor object right after the DestURL has been
fetched. In order to get DesrWRL and anchor object one by one, the following steps
should be added after the four steps introduced above:
step 5: continue with step 4, analyze the string between "<a hrePDestURL">"and
"</a> 1,;
step 6: ifthere are strings starting with "<" and ending with ">", skip the string;
Note: Sometimes there are other markups for the anchor object.
step 7: fetch the string that is not containeci in any "<" and "Y pair if availabie;
othenvise, assign the anchor object as an empty string;
Note: Since we are ody interesteci in the textuai anchor object, any other anchor
objects that are not character h g s wiU not be emacted.
step 8: stop when string "</a>" is encountered-
The above 8 steps are used to extract the DestLIRL and the anchor object fiom an
HïML file. They just illustrate a generai idea and some other things should also be
considered, such as skipping spaces, etc., when the paner is hplemented.
Finally, it is worth mentioning that DestURL sornetimes appears as a relative URL if
the destination document is on the sarne Web server as the source document. If we just
simply fetch the relative üRL, we can not obtaui usefil information. In our prototype, we
convert relative URLs to absolute URLs since the equivalent absolute URL cm dways be
constnicted from the URL of the current document and the relative URL. The aigorithm
us& to cgnvcfla r e l ~ ~ y e LpL t_c a &SC!*= L@L iz US fific-+y~:
Suppose the syntax of an absolute URL is:
and we know the current URL, which is the üRL of the web page in which the
relative URL is defined-
if the relative URL begins with "/"
fetch the string "<Protocol>://~Host>" fiorn the current URL;
append the relative URL to the above string;
eise if the current URL ends with "/" or the relative URL beguis with "#"
append the relative URC to the current URL;
else if the current URL ends with ",htmlW or " .htmV
fetch the string "<Protoc~B://~Host>/~Path>/~' f?om the current URL;
append the reIative URL to the current URL;
4.4.1.3 Extracthg other Information
Other information that shouid be extracted fiom the HIML file incfudes the number
of links, author, las modified date and the size of the file. The number of links can be
obtained by using an incremental counter when we parse the HTML He. The name of
author or last modified date can be found nom the meta-infornation provided by the meta
element defïned in HTML or somewhere in the document specifyuig nich information.
The method used to get the ùiformation is simila. to the rnethods uitroduced above. For
the size of the fie, we can simply use a UNIX utility to get it.
4.4.2 Stonng the Information in the Database
When information that describes the web pages has b e n extracted fiom HTML files
successfùliy, it should then be stored in the database relations. As an exarnple, the tuples in
the following three relations are derived fiom the Hïl& nle shown in Figure 2.3 by the
parser developed for our prototype system.
webpage:
webpage-d:
xiaoyu yang's
home page
Table 4.1: Relation webpage with 1 Tuple
Ur1 I content
Table 4.2: Relation webpage-d with 1 Tuple
Note: . In relation webpage, since the HTML does not provide infiormation about author,
the author column is thus empty.
. In relation webpage-d, the column content refers to the HTML file.
links:
description
persmai information
research work
http ://www-csd.uwo.ca/
http YrIvww.csd.uwo.ca/gradstudentS/
Courses
http ://www .csd.uwo.ca/
http d/www.csd.uwo.ca/pradstud~
department of
cornputer science gradstudents/students/xyang/
http ://www.csd.uwo.ca/ http Y/www.csd.uwo~cal~ed
middiesexhd micidiesex coilege
the university of
western orrtano
http ://www -gov.on.ca/
http ://canada.gc. ca
Table 4.3: Relation iinks with 13 Tuples
4.5 Query Facilities
Our prototype system provides three kinds of query: content quesr, structure query
and advanced query.
45.1 Content Query
Content queiy is the query that rekis :s tr,e content of the documents only. In our
data model, content queries are posed on the node. Hence, it involves t w ~ relations in our
relational rnodel of the Web, Le., relation webpge and webpage-d. SSimilar to search
engines, this facility provides users a very simple query interface and allows users to enter
keywords for the query. Figure 4.4 shows the query interface for the content quesr.
-- -- - -- -
Figure 4.4: The Query Interface for Content Query
When the user enters keywords, the keywords will be used to compose an SQL
statement by the query processor. In this SQL statement, the keywords define the
condition of the search, therefore, they appear ody in the where clause. For exarnple, if
the user enters the keyword "Database", the correspondiig SQL statement generated by
the query processor is as foliows:
select w.url, w-fitle, w-author, w-iink-no, W. last-mow, W. size
from webpage w, webpage-d d
where w.url=d. url and &content like '%Database%'
When this SQL statement has been executed, a set of results can be obtained. Each of the
results refers to the web page that contains the keywords specsed by the user. The query
processor then restructures the results and displays thern to the user. The format of the
result presented to the user is Uustrated in Figure 4.5.
1 - URL:
Tif le:
Total Links: - Last modified: - Page Size - 2. URL:
Titie:
Total Links: - Last modified: - Page Size -
By: -
By: -
Total 2 match(es)
Figure 4.5: Results of a Content Query
4.5.2 Structure Query
A structure query is a query posed on the underlying hypertext structure of the web
pages. In Our data model, it quenes the structure of the graph. Therefore, only relation
links is involved in structure quenes. The design of a structure query is highly inspired by
the style of QBE (Query by Example) [15], in which users only need to fiu appropriate
places in an empty table to pose the query. The query interface is show in Figure 4.6.
When a user poses a quexy, the query processor WU dynamicdy prepare the query and a
corresponding SQL statement will be constnicted according to what is specified by the
user in the query. Here, we take advantage of dynamic SQL in DB2 so that the user cm
spec* his or her queries in a flexible way, that is, a query can be composed by any
combination of the fields appearing in the interface. For example, assume the queq is
posed as foiiows:
Source U K : http://www.csd.uwo.ca/
Destination URL:
Description: graduate program
The corresponding SQL statement generated by the structure query processor is:
While the SQL statement is executed, a set of results wili be coliected, formatted and
presented to the user. The format of the results is shown in Figure 4.7.
-- Structure Query -
Please d e h e your query:
Source URL: -
Destination URL:
Description:
Figure 4.6: The Query Interface for Structure Query
1. Source URL:
Destination tTPJ,:
Description:
2- Source URIL:
Destination URL:
Description:
Total 2 match(es)
Figure 4.7: Results of a S tnicture Query
The Advanced Query simulates the CLP (Command Line Processor) provided in
DB2. It can interactively accept SQL statements f?om a user and execute them, and
display results of various data types. The query can be posed on any of the three relations
that are proposed in our prototype. This kind of query offers a flexible rneans to pose
queries. However, it also makes the implementation of the query processor more difficuit,
since we do not have a due what the users will include in theû queries. For example, we
do not have any idea about the number of colums in the result set and their data type.
This kind of information, however, is essential in executing a query. On the other hand,
DB2 offers a number of descnptors for passing data types andor values between the
application program and the database, e-g. SQLDA, which can describe the data types,
lengîhs, and values of a variable number of data items. These descriptors are more flexible
than a List of host variables used in normal dynamic SQL, because they can be dynamicdy
configured for different numbers and types of data items at run tirne. The implementation
of our query processor is based on these reference variables provided by the system. M e r
a query is executed, the query processor analyzes the results of several descriptors.
According to these results, information about the number of columns in a result, the data
type and the value could be obtained. AU these kinds of information are what compose the
quey results. In an advanced query, a user interface is developed to coUect and process
the mihoc quenes posed by the user. This interface is show in Figure 4.8.
The results of a query correspond to what are specified in the select clause of a user
SQL statement. When displayhg the resuits, the UNIX editorpico wiil be invoked and the
results will be displayed by pico. The reason we exploit an existing editor is that it can
provide convenient edit tools for the query results. In this way, the user cm choose to
Save the results to a füe or select only part of the results to be saved. It is aiso usefil when
the user wants to compare several query results, because the previous results are not lost.
- Advanced Query -
** Quenes can be posed on the following three relations: **
webpage (4 title, author, linho, l~ l~ t~rnodr f i ed size)
webpage - d (id, content)
links (url-a, url - b, description)
Note: column content can not appear in a select clause
Please enter an SQL statement or 'quit' to Quit:
Figure 4.8: The Query Interface for Advanced Query
1. Column Name 1:
Column Name 2:
C o l m Name 3:
2. Column Narne 1:
Column Name 2:
Column Narne 3:
-- - - -- -
Figure 4.9: Resuits of an Advanced Query
F i s r e 4.9 shows the oeneral format o f a rem!? set ln this fisre, C.c>lumn Names are
corresponding to the narnes that have been specined in the select clause of the SQL
statement. Please note that only the attrîbute names in the three relations can appear in the
select clause. The only exception is that the attribute "content" can not appear in the select
clause, because the data type of this attribute is CLOB, which is not suitable to be
displayed in a Iimited space.
4.6 User Interfaces
Besides the query interface introduced in section 4.5, our prototype system also
provides a simple user interface, by which the user c m select dierent functions. Basicdly,
there are two main user interfaces, one provides the access to the three dBerent query
facilities, which is shown in Figure 4. IO; the other guides the user to the supplernentary
tiinctions used to maintain the web pages stored in the database. This interface is shown in
Figure 4.1 1.
Querying the Web Main Menu
1. Content Queries Ody
2. Structure Queries Only
3. Advanced Queries
4- Others
x. Exit
Figure 4.10: The User Intedace for Accessing Query Facilities
4.7 Supplementary Functions
There are several supplementary fiinctions in our prototype system providuig a means
to access the relations used in our prototype. Currently, Our system provides the following
fiinctions that can be used to:
find hyperlinks from a locaiiy stored HTML file;
create a relation in the database;
add or delete a web page from the database; and
display the tuples in relations webpage or links-
Others
1. Find Hyperlinks from an HTML File
2. Create Relation
3 . Add One Web Page to Database
4. Display Rel~iion webpage
5. Display Relation links
6. Delete One Web Page fkom Database
Figure 4.1 1 : The User Interface for Supplementary Functions
4.8 Summary
This chapter focuses on the implementation of our prototype web q u e m g system.
For rnodeling the Web, we use a iabeled directed graph, whkh gives us a straightfonvard
view of the hypertext structure. Based on this data modei, it is possible to map the web
pages to relations in a relationai database. Once in the database, web pages can be queried
as other data in the database. However, web pages have their special features, such as
their hypertext structure and irregularity. Therefore, queries on web pages have specid
meaning compared to traditionai queries. One distinct feature in querying the Web is that
hypertext structure should also be considered, which is leading to the development of the
Nc tu re query. To accornplish this, three basic queqi facilities are provided in our
prototype system namely, content query, structure query and advanced query- Content
query is used to accept user keywords and then fhd the web pages that contain the
keywords. Structure queiy lets us go beyond the keyword search so that hypertext
structure c m be queried as well as keywords. The advanced query is designed for ad hoc
queries. It provides a powemil means to query the web pages stored in the database
flexibly. Besides these functions, there are also some other functions that are used to
maintain the web pages stored in the database. These functions are especialfy useful when
we develop and test the prototype. During the development of the prototype, we benefited
a lot fkom exploithg an existing database, Le., DB2 in our case. And its power will aiso be
show in the next chapter when we pose queries on the web pages. As cm be proved, if
the Weh is modeled hy 5 p g e r data nide! and the mode! cru? be reprce??tec! hy re!&cns
in a relational database, advanced database technology will provide high level query
facilities to query the web pages in a flexible and powemil way.
Chapter 5
Querying the Web Pages
As descnied in the previous chapters, our web queryhg prototype systern provides
three alternatives to pose the query: querying the keywords, querying the hypertext
structure and querying arbitrary content. This chapter &arts by describing the distinct
features of these three kinds of queries. Then, the results obtained by running various
types of quenes are examined.
5.1 Query Methods
The essential query in web querying is the content query, which is also known as the
keyword query. This kuid of query is provided by dl search engines. The idea of content
query is to fhd the web pages that contain the keywords specified by the user. Obviously,
querying the Web by this means only is not enough, because in content query, there is no
way to explore the hypertext structure, which, however, is an important feature of web
pages. Therefore, we developed a query facility called structure query to deai with the
hypertext structure. In addition, our prototype also provides an advanced query facility,
which allows users to pose queries in a flexible way. This advanced query facility is able to
query al l the web pages stored in the database. Ln the foiiowing subsections, we illustrate
how these three query facilïties work.
5.1.1 Content Query
In our prototype system, the content query uses keywords to query the content of a
web page. The possible keywords can be a single word, a phrase, a sentence or anything
that can appear in an H+ïML fie. Our prototype system can even allow the keyword to be
null. In this case, the system assumes the user asks for displaying ail the web page
information stored in the relation webpage, e-g., URL of the web page, the title, the
number of links, etc. Aiso, wild cards can be embedded in the keywords. The wild card
recognized by DB2 is 'W. It can appear anywhere in the keywords, representhg an
arbitrary character string. In our system, the keywords used to search the information of
interest are partly case sensitive. Users have alternatives to enter their keywords in either
upper or lowercase. When in doubt, the user can use lowercase text in queries. In this
case, the query facility fhds both upper and lowercase results. When uppercase text is
used, the query faciiity finds results ody in upper case. For exampie, when the user
searches for "london", al l occurrences of "london", "London" and "LOMlON" can be
found in the results. However, when the user searches for "London", only "London" will
be seen in the results. As can be noticed, to enter the keywords exactly as show in a file
can narrow the scope of the search so that more precise results can be obtained.
5.1.2 Structure Query
The structure query is used to query the underlying hypertext structure embedded in
web pages. Thus, this kind of query explores the web pages by their hyperlinks. In our
prototype system, we provide a QBE-like interface for the structure query. With this
query interface, the user can pose a query by simply fiilhg in the blanks. The interface
provides the foUowing three fields to be fXed in: Source URL, Destination URL and
Description. Source URL refers to the URL of a web page in which hyperlinks are
contained. Destination URL is the URL of a hyperlink contained in the web page.
Description is the anchor text describing the hyperlllik. There are several alternatives that
users can use to pose a query. Ifusers would like to find out aii the hyperiinks in a certain
web page, they just need to provide the information of source URL. If users provide only
the information of destination URL, aU the web pages that have a Link pointing to the
destination URL will be found. Similarly, if users enter both the source URL and the
description, then aU the hyperlinks with the speci£ied description in the web page defined
by the source URL will be found , and so on.
5.1.3 Advanced Query
This kind of query is designed for advanced users. There is no restriction on posing
the query. Users can simply write any SQL statements related to the three relations used in
our system, Le., w e w g e (url. title, author, link-no, l ~ ~ ~ t ~ r n o d z ~ e d , size); webpge - d ( i d ,
content); links (id-a, url - b, description). Here, attributes Iink-no and lmttrnodi~ed are
integers; the data type of attriiute content is CLOB. All the other attnibutes are character
s t ~ g s . What is worth noticing is that only attributes mentioned in the three relations can
appear in the SQL statement.
The user interface of the advanced query is a mimic of the CLP (Cornmand Line
Processor) of DB2. Therefore, the user can just &te the SQL statements and then
execute them. The results of an advanced query totaily depend on what are specïfïed in the
SC!= c!âiisc of SQXL stâ:err;er;t.
5.2 Experimental Results
To test our prototype system and ver* the power ofusing a database query language
to query the Web, we have nin several types of queries and we analyze some of them in
the foiiowing examples. In our experirnent, more than 70 web pages are chosen to run the
test. AU these web pages are reachable fiom the home page of the Department of
Computer Science, the University of Western Ontario on August 2, 1998. For test
purposes, the HINE £iles of corresponding web pages are downloaded to the local disk
and are mapped to the database in advance.
5.2.1 Content Query
Content queries are posed on the content of an HTML fde. Wild card and HTML tags
may be used in a content query. To simpm the query interface, in the following examples,
"Input" illustrates the input entered by the user and "Results" shows the query results
presented to the user.
Query 1: Find the web pages that contain the keyword "database".
input: database
Results:
1. URL: htep://www.csd.uwo.ca/research/
Tue: uwocsd - department research
Total Links: 4 Page Sue 13 15
2. URL,: http ://www.csd.uwo.ca/grad~~~ur~es hbml
Title: uwocsd - graduate courses
Total Links: 12 Page Size 9918
3. URL,: http ://www.csd.uwo.ca/fic~/sylvia/pubsh~
Title: sylvia osbom's recent publications
Total Links: 1 Last modified: 0 1/2 Vl998 Page Ske 2058
4. URL: http ~hww.registrar.uwo.ca~accds/l997/sub-16 hm
Title: cornputer science
Total Links: 50 Page Size 5920
5 . LRL: http://www.registrar.uwo.ca/acca1sI 1 997/crs43 5 .htm
Title: course description
Total Links: 9 Page Size 1413
6 . URL: http :/htiww.regiçtrar.uwo.ca/acdd 1997krs-420 .htm
Title: course description
Total Links: 9 Page Size 15 19
7. URL: http ~/~~~.csd.uwo.dgradstudents/students/xyans/researchh~
Tie : research work
Total Lniks: 35 Last modified: 02/18/1998 Page Size 4020
8. URL: http ://www .csd.uwo.dgradstudents/studmts/xyang/~~~~html
Titie: course Sormatim
Total Links: 23 Last modZed: 02/18/1998 Page Size 6408
9. LEU: http ://www,csd. uwo.ca/faculS/sylvia_htrnl
Title: syfvia 1- osborn
Total Links: 2 Page Size 1184
1 O. UKL: http://www.~strarstraruwoOca/accais/l 997/crs_4O9kn
Titie: course description
Total Lmks: 8 Page Size 1622
1 1. URL: http://www.uwo.ca/gra~esis/&ap la_h.tml#ove~ew
Tïtle:
Total Links: 1 Page Size 13393
Total I l match(es)
in the above query, the query results are the web pages containhg the keyword
specified by the user, i-e., "database".
Query 2: Find the web pages that contain the keyword "Database" in their text.
Input: Database
Results:
1. UEU,: http://www.csd.uwo.ca/research/
Titie: uwocsd - department research
Tcstal Links: 4 Page Size 1315
2. URL: htîp :/hcrww.csd.uwo.ca/grad~~~~~~es . h t d
Title: uwocsd - graduate courses
Total Links: 12 Page Size 99 18
3. URL: http~/~~~.csd.uwo.ca/fic~/sy1via/pubs~~
Title: sylvia osbom's recent publications
Total Links: 1 Last modifieci: 0 1/21/1998 Page Site 2058
4. a: http://tvww.registraregistrar~o.ca/accaWL997/subbI6htni
Tue: cornputer science
Total Links: 50 Page Size 5920
5. URL: http ~ / ~ . r e g i s t r a r . ~ ~ o ~ c a / a c c a 1 d 1 9 9 7 / ~ 4 3 5 hm
Tde: course description
Totai Links: 9 Page Size 1413
6. W: http:/hvww.registrar.uwoxa/accals/ 1997/crs420 .hm
Tïtie: course description
Total Links: 9 Page Size 15 19
7. URL: http://www.csd,uwo.ca~gradstuddstudents/xyang/resear&html
Title: research work
Tatal Links: 35 Last modifieci: 02/18/1998 Page Size 4020
8. URL: htîp://www. csd-uwo. ca/gradstudents/students/~9/~0~~ses htmi
Title: course information
Total Links: 23 Last modifieci: 02/18/1998 Page Size 6408
In the above two queries, the oniy merence is that the keyword in Query 1 , i-e.,
"database", is in lowercase, while the fkst letter of the keyword in Query 2, i-e-,
"Database", is in uppercase. nie results of these two queries are different because content
query is case sensitive. The results of Query 2 are included in the results o f Query 1.
Query 3: Find the web pages that contain the string "computer science 41 1" in their text.
Input: computer science 41 1
Resuits:
1. URL: http ://www.registrar.uwo.ca/accals/l997/crs_435 .htm
Title: course description
Total Links: 9 Page Size 1413
2. URL: htîp Y/m-registrar,uwo.ca/accak/l997/çub-16 hûn
Title: cornputer science
Total Links: 50 Page Size 5920
3. URL: http://www.csd.uwo.ca/grad/~o~~~es hd
Titie: uwocsd - graduate courses
Total Links: 12 Page Size 99 18
This is a query that fhds web pages containing a string specined by the user.
Query 4: Fmd the web pages that contain the foilowing keywords which are in the order
of "thesis", "submission" and "deadline" in their coritent.
Input: thesis % submission % deadline
Results:
1. URL: hap ://www.uwo.ca/pd/tbesis/&ap 1 a_html#overview
Title :
Total Links: 1 Page Size 13393
In this example, wiid card is included in the keyword, therefore the result web page
should contain the character string that begins with "thesis" followed by another character
string or none and "submission" foilowed by a character string or none and finally ends
with "deadline". As can be seen in the result, since this web page does not provide any title
information, The Title field in the result is empty.
Query 5: Find the web pages that contain meta-information.
Input: <meta%>
Results:
1. URL: hitp ://www .registrar.uwo.ca/ac&
Tie: academic calendars
Total Links: 9 Page S ü e 3 120
2. URL: httpY/www.uwo.ca/
Tue: the miverçity of western ontario
Tutai Links: 22 Last modifie& 1998/02/02 Page Size 4438 By colleen
Some HTML £iles have meta-information, which is used to provide general
Somation of web pages. But this kind of information is not displayed in a web page.
Meta-idiotmation is defined by an HTML tag <meta (attribute list)? Therefore, if we are
interested in the web pages that provide meta-information, we can find these pages by
simply looking for the correspondhg HIML files that contain the <meta> tag.
Query 6: Display the uifonnation of all the web pages stored in the database.
Input:
Results:
Would you Like to display al1 the records? Oh) n
In this query, users do not need to enter anythmg. The system will display by default
ail the information of the web pages currently stored in the database. In case that there are
too many web pages stored in the database, it is not practical to display ail the
information. Hence, if the keyword entered in the content query is null, the system will
prompt the users to confirm whether or not they wish to display of all the uifonnation.
Query 7: Find the web pages that contain string "travel agent".
Input: travef agent
Results:
No match has been found.
Since none of our test web pages contains the keywords "travel agent", there is no
result found-
5.2.2 Structure Queries
Structure queries are posed on the hypertext structure of the web pages. In this kind
oFquery, we expioir a QBE-üice query interface for users to pose quenes.
Query 8: Find ail the hypertinks containeci in the web page whose URL is
" http://www.uwo-car
Input: Source URL: http://www.uwo.ca/
Destination URL:
Description:
Results:
1. Source URL: http://www.uwo.ca/
Destination URL: http ://search. uwo-ca: 87651
Description: search uwo
2. SourceüRL:http://www.uwo.ca/
Destination URL: htîp ://www.wo,ca~aborrtuwo/~rms htd
Description:
Source W: http://www.uwo.ca/
Destination URL: htîp ://www.uwo . d a boutuwo/mcintosh.html
Description:
Source URL: http://www.uwo.ca/
Destination URL: http :/hvww.uwo.ca/academic/
Description: academic programç làcukies and colkges
Source UU: http://www.uwo-ca/
Destination URL: http://www.uwo.ca/admHiservices/
Descnptim: administration and services
Source IIRL: http://www.uwo.ca/
Destination URL: http ://www .uwo.ca/alumni/
Description: aiumni and fiends
Source URL: http:lhnrww.uwo.ca/
DestÏnztion URL: http ://www. uwo .cahmdex htrd mm---&:-. U C i S b I 1puvLl-
Source URL: http://www-uw0.d
Destination URL: http ://www .uwo.ca/ip/
Description: web publishing at uwo
Source üRL: http://www.uwo.ca/
Destination URL: http ~/www.uwo.ca/libinfo/
Description: libraries and information resources
Source URL: httpi/www.uwo.Ca/
Destination URL: http ://www.uwo.calresearch/
Description: research and scholadip
Display more (y/n)? y
1 1. Source URL: httpY/www.uwo.ca/
Destination URL: http://www.uwo.ca/çelected~browsebtml
Descriptim: internet
12. Source URL: http://www.uwo.cal
Destination URL: http://www.uwo.calsguide/
Description: admissions and student resources
Source URL: http://www.uwo.ca/
Destination URL: htîp:/h4rww.uwo~caluw0~0m/
Description: campus idonnation news and events
Source URL: ht&pY/www.uwo.ca/
Destination üRL : htîp ://www.uwo.ca/uw~~~d~~~nmuILity/
Description: western m the comrnmiîy
Source UR];: htîpYhsrww.uwo.ca/
Destination URL: http://www.uwo.ca/whatsnew/
Description: new
Source URL: hîtpY/mwv.uwo.ca/
Destination URL: http Y//www.uwo.ca/whois/
Description: find people
Source URL: h#p://www.uwo.cal W;n-r+;cm lm1 - rn-;k--r.*Lrr.r4-4;r.l;r- .-+-ri rm UuerriluUVU Y&Ui. ILLQIILW- WbVLLLQSLGAwJ ULLQU-UWV-CA
Description:
Total 17 match(es)
In this query, users oniy need to spec9 the Source URL. and leave the other two
fields blank. The results are the URLs of the web pages reachable nom the web page
identined by the Source URL.
Query 9: Find the web pages that contain a Mc to the web page whose URL is
"http://www.uwo.ca/".
Input: Source URL:
Destination URL: http://www.uwo.ca/
Description:
Results:
1. Source URL: http ://www.csd.uwo.ca/~c~/sylvia/
Destination URL: http~/~r~r~.uwo.ca/
Description: the university of western ontario
2. Source URL: httpdhinvw.uwoxa/grad/
Destùiation URL: httpY//www.uwo.ca/
Description: uwo home page
3 - Source URL: http~/bvww.regiStraregistraruwoO~accals/
Destination URL: http Y/www.uwo.ca/
Description:
As shown in the query, when ody the field Destination URL is med, the resuits d l
be the web pages that have Links to the web page specified by the Destination URL.
Query 10: Find the web pages that contain hyperiinks about course 407.
Input: Source URL:
Destination URL:
Description: 407
Results:
1 . Source URL : http ://www.csd.uwo -ca/courses/
Destination URL: http ://www .csd.uwo .ca/courses/cs407/
Description: cs407 - advanceci software engineering
2. Source URL: http ~/~.re~strar.uwo.ca/accds/l997/sub-l6Iitm
Destmation W: http://www.regishar.uw~~ca/acdd 1997/crs-434hm
Description: cornputer science 407a/b
Total 2 match@)
In this query, only Description is specified. The Destination LRL in the result is the
hyperlink whose anchor text matches the string entered in the Description field. The
Source URL identifies the web page that contains the hyperlink.
Query 1 1: Retrieve the anchor text of a given hyperlink in a certain web page.
Input: Source URL: http://www.uwo.ca/
Destination URL: http://www.uwo.ca/sguide/
Description:
Redts:
1. Source URL: http://www.uwo.ca/
Destination URL: http ://www .uwo.ca/sguide/
Description: admissions and student resources
This kind of query is used in case that users know the URL of a web page and one of
its hyperhnks, and they are interested in what the hyperlink is about. It wiU extract the
detailed description of the specified hyperlink for users.
Query 12: Find the hyperlinks about "submission" and "thesis" in a certain web page.
Input: Source URL: http://www.uwo.cdgrad/thesis/
Destination URL:
Description: submission % thesis
1. Source URL: http://wam.uwo.ca/gradtthesis/
Destïuation URL: http Y//www.uwo.ca/gra&esis/&ap 1 ahtmiffial
Description: 1.5 final subrnission of the examineci and corrected thesis /3
2. Source URL: h#p://www.uwo.dgrad/thesid
Destination URL: htip Y/www.uwo.ca/gradithesis/chap l a ~ # s u b m i s s i o n
Description: 1.2 submission of the thesis fbr examination /I
Total 2 match(es)
As c m be seen in these results, two hyperlinks contained in the web page specified by
the user are found. The keywords "submission" and "thesis" appear in both of the fields of
Description.
Query 13: Find the web pages that contain a link to the home page of the Department of
Computer Science, the University of Western Ontano, and the anchor is about
"department of cornputer science"
Input: Source URL:
Destination URL: http://www.csd.uwo.ca/
Description; department of computer science
Results:
Source URL: hop ://www.csd.uwo.ca/~cuIty/sy1Via/
Destination URL: http ://www.csd.uwo .ca/
Description: department of computer science
Source URL: http :///www.csd.uwo.ca/gradstudents/students/~p/
Destination URL: httpYhvww.csd.uwo.cal
Description: department of computer science
Source URL: http:///www.csd.uwo.ca/gradstudents/çtudents/xyang!persona1hemi
Destination URL: http :/hvww.csd. uwo-cal
Description: department of cornputer science
Total 3 rnatch(es)
When users know a hyperfi.uk and the corresponding anchor text, they can use this
query to find out in which web pages the hyperlink is contained.
Query 14: Find whether there is a web page whose URL is specsed as Source URL and
it contains a Link to another given web page.
Input: Source URL: http://www.uwo.ca/
Destination üRL: http://www.csd.uwo.ca
Description: computer science
Result s:
No match has bem found-
Since the home pase of the University of Western Ontario does not have a link to the
Department of Cornputer Science's home page, the result of this query is empty.
Query 15: Enter nothing for the query.
Input: Source URL:
Destination URL:
Description:
Result:
Would you Like to display ali the records? Qh) n
If users enter nothing for the query, the system assumes users wodd like to examine
dl the tuples about the hyperiinks. Since there may be too many tuples to be displayed, the
system prompts users to confirm whether or not they want to display the resdts
5.2.3 Advanced Query
Advanced queries can be posed on the content, the hypertext structure or the
combination of the content and the hypertext structure of web pages. This kind of query is
written in an SQL statement, hence provides a flexible means for users to pose the query.
Meanwhile, users can take full advantage of advanced database query facilities when
writing the query in SQL statement.
Query 16: Find the web pages about "database".
Input: select w-url, w-title, w.1in.k-no, w-size \
from webpage w, webpage-d d \
where W. url=d.url and d. content like '%database%' gesu1is-
W: http ://www.csd.uwo.ca/grad/~~~~~es.hbnl
TITLE: wocsd - graduate courses
LINKNO: 12
SIZE: 9918
URL: http ~ / ~ . c s d . u w o . C a / ~ c ~ / s y l v i a h t m l
TITLE: sylvia 1. osbom
LM-NO: 2
SIZE: 1184
URL: http://www.uwoca/gradlthesis/chap la-htmjkbverview
r n E :
L K N O : 1
S m : L3393
URC: http ~/~.registrar.uwo.caiaccals/l997/crs_43Sh
TITLE: course description
LDK-NO: 9
S m : 1413
5. üRL: http Y/'~\~~.registrar.uwo.cafaccaW l9W/crs-4îQ .hm
TITLE: course description
LINK-NO: 9
SIZE: 15 19
6, tTRL: h@://www.registrar-uwo.ca/accaldL 997/crs_4Og htm
T?TLE: course description
LM-NO: 8
S m : 1622
As shown in the above example, this query can be used as a content query. It h d s the
web pages, which have "database" in their text. The merence between this query and the
content query is that the former looks for the web pages contain the exact keyword
specined in the where ciause, wMe the ianer searches the web pages contak the keyword
in botti lower and uppercase.
Query 17: Find web pages whose titles contain "home page".
Input: select url, title \
f?om webpage \
where titie like "%home page%'
Results:
1. URL: http://www.csd.uwo.ca/fic~/syivia/
TITLE: sylvia osborn's home page
2. URL: http ~/\~~~.csd,uwo.calgradstudents/students/xyan9/
TITLE: xiaoyu yang's home page
This query c m b e used to h d out a group of web pages with a particular topic.
Query 18: Fhd web pages which have no outgoing Links.
Input: select urI, title \
fiom webpage \
where 1in.k-no=û
Redts:
This query c m find ail the leafnodes in the graph model used in our system.
Query 19: Retrieve the titie and the URL of ail the web pages that are pointed to f?om the
web page whose URL is http://www.csd.uwo.ca/gradstudents/studentd~m~.
Input: select w.url, w.title \
fkom webpage w, iinks I \
where l.url-a='http://www.csd. u w o - c a / g r a d s t u d e t s d e n t \
and w.url=l.url-b
1. URL: httpY/~~~.csd.uwo.ca/fic~/sylvia~
?TïZE: sylvia osbom's home page
2. URL: http ://www.csd.uwo .ca/gradstudents/studeuts/xyan~co~~~es
TiTLE: course uiformation
3. W: http~/~~~.csd.uwo.calgradstudentslstudents/xyan~interestsh~
TILE: intereshg links
4. m: TITLE:
5- URL:
m E :
6. URL:
m E :
This query is used to find all the hyperlinks in a weo page spec5ed by the user.
Query 20: Find the URLs of web pages mentionhg WebSQL1 from the web page:
http://www.csd.uwo.ca/gradstudentdstudentdxyang/-
Input: select w u 1 \
from webpage w, webpage-d d, links 1 \
where d-content like '%WebSQL%' \
and 1.url-a='hîtp://www. csd.uwo.ca~gradstudentdstudents/xyang/l\
and d.url=I.url - b and w.url=l.url-b
Results:
From a given web page, this query tries to find ali the web pages that contain special
information of interest and are pointed to tiom the given web page.
Query 2 1: Find web pages contain anchors mentioning "publications" in their t ea .
Input: select w-url, w-title \
nom webpage w, Links 1 \
where 1-description Iike '%publications%' and 1-url-a=w.url
1. URL: hûp ~/www.csd.uwo.ca/fàcuhy/syLvia/
TmE: sylvia osborn's home page
This query cm h d al1 the web pages that contain hyperlinks mentioning a particular
topic.
Query 22: Find web pages with a certain topic and reachable £kom a given web page
within three levels.
Input: with \
temp(ur1-b, description, stops) as \
((select url-b, description, cast (O as smaiht) \
fiom links \
where descnption='computer science 4 1 1 ah') \
union ail \
(select 1. url-b, 1. description, cast(t. stops+ l as smallint) \
fiom Links i, temp t \
where t.url-k1.u- \
and Ldescnption like W/computer scienceY~' \
and t.stops<Z)) \
select p-url-b, p-description \
Eom temp p
Results:
1. URL-B: http://www.registrar.uwo.ca/accais/l997/crs-435h~
DESCRIPTION: cornputer science 4 1 1a/b
2. -B: http~/\iivw~.registrar.uwo.ca/accaW l997/crs-42Ohm
DESCRIPTION: computer science 3 19ah
3 _ bTRL_B: http Y/~~~.registrar.uwo.ca/ac&Z997/cr~-4 14.htm
DESCRPTION: computer science 20 2
This example illustrates a recwsive query. Recursive query is an important feature in
the new generation of relationai databases. It 1s very powedbi, because it allows certain
kinds of questions to be expressed in a single SQL statement that wou!d otherwise require
the use of a host program. As c m be seen in the example, the use of recursive queiy lets
us go dong the hypertext hierarchy to find the information of interest, which is impossible
by using any of the search engines currentiy available.
5.3 Summary
This chapter focuses on the query methods implemented in our prototype system and
the experimental results obtained by using these query methods. Currently, there are three
kinds of query facilities available in our prototype system, narnely, the content query, the
structure query and the advanced query. The content query searches the keywords
speciiïed by the user. In thïs kind of query, wild card or HTML tags are allowed to be
included in the keywords. From the experimental results, it is easy to £ind that any word,
phrase, tag, sentence can be used as keyword as long as it appears ui an HTML file. As
can be seen, it is an essentiai yet powerful method to find web pages of interest. However,
there enas a significant limitation in the content quesr, that i s it cm not be posed on the
hypertext structure presented in an HïML fiie. To remedy this defect, we developed a
new query facility which is cailed structure query. 1t aliows users to pose queries on the
UEtL of a web page, the hyperlinks of the web page and their corresponding anchor text.
This kind of query makes it possible to explore the structure of web pages and thus
enhances the quev capability of Our system. In addition, our prototype system offers an
advanced query faciiity by which quenes can be posed in the form of SQL statements. It
aiiows the cornplex queries to be posed to search bot . the content and the structure of
web pages. This query facility takes fùli advantages of the database query language and
therefore provides a very powerfùl web querying means to users.
More than 70 web pages have been used to test our prototype system. The
experimental resuits indicate that users can thoroughly search the web pages and obtain
the i n f ~ ~ d ~ f i cf inte<pry hy t h q ~ e r f f2CllpS prc\<&d h "3 x r n t v r y- Y- nrnfnhme J i'-
system. The experimental results also strongly show that the database query language is a
very powemil tool for querying the Web.
Chapter 6
Discussions and Future Work
6.1 Discussions
In this thesis, we introduced Our protowe system for querying the Web with
database query languages. Currently, there are two ways to retrieve idormation fiom the
Web, namely, navigationibrowsing and searching by search engines. However, there exist
some significant limitations, such as the "lost-in-hyperspace" phenornenon, no query
facilities for the hypertext structure, etc. These drawbacks incited the development of Our
web query systern. The design of our prototype system is based on the idea that the
combination of database techniques and the Web modehg technique shodd provide a
much more powerful and flexible way in queqing the Web.
To query the Web, the first problem encountered is the web modelmg. In Our
prototype system, we mode1 the Web as a labeled directed graph, with web pages as nodes
and labels on the edges. Based on this graph data rnodel, the Web can be viewed as a
relational database, which is composed of three relations: webpage, webpage-d and links.
- This relational abstraction of the Web dows us to store web page information into a
database and use a database query language to retrieve the information of interest.
To extract the information of a web page nom the source HTML file and store it into
the database, a parser was designed and implernented in our prototype system. The user
interface was also developed to accept the query and guide users to d the available
functions of ow system.
As for the query facilities, our prototype system was implemented to support three
kinds of query, oamely, the content query, the structure query and the advanced query.
The content query provides the keyword querying fiuiction that most search engines
provide. Its limitation is that it c m not query the hypertext structure of the web page. To
address this problem, our prototype system provides a particular query facility called the
structure query. It allows users to search the web pages by their underlying hypertext
structure. Furthemore, our system also offers an advanced query facility, by which users
can pose arbitrary quenes in the form of complex SQL statements. This query method
benefits a lot fiom the advanced query capabilhies of the database query language and
allows quenes to be posed on both the content and the hypertext structure of web pages.
The prototype system has been extensively tested with more than 70 web pages. The
testing results show that the database query language has been successfully used to query
weh page* and the wb!e syitern fiE&ms wry weH-
As a newly developed prototype, our web querying system has some limitations. It is
suitable for searching web information in a Limited scope, but not scdable to the whole
web at this moment. Also, at the current stage, our system can not automatically react to
changes of web pages. Moreover, the content query actiially duplicates what search
engines already do. Thus, the fùture work should address overcorning these Limitations
and turning this prototype into a complete web querying system-
6.2 Future Work
There is a lot of work that could be done in the area of q u e m g the Web. The work
is related to several areas in cornputer science, including databases, networks, user
interfaces, and algorithm design, etc. Currently, our prototype system emphasizes only on
the issues of database and provides a fhmework for future development of the web
q u e m g system.
The est and foremost improvement that could be made to our prototype system is to
make the system have the capability of interacting with the Intemet dynamicaily. To date,
our protorype system can only deal with the web pages that are stored on the local disk.
These web pages are downloaded fkom the Web in advance. In order to query the Web
dynamicdy, the prototype system shodd be extendeci in severai aspects. Firstly, the
curent programs written in C shouid be translateci into Java to take advantage of its
strong capabiiity of communicating with the Intemet. Secondly, the content query facility
should be enhanced. The keyword search cm be divided into two seps. In the £kst step,
several search engines are used to query web pages and the results fiom different search
engines are combined to provide more extensive information. A similar work can be found
in [16]. In the second step, the results fkom the first step wiU be further searched by Our
content query facility to eliminate the irrelevant results obtained f?om search engines.
Intuitively, these two aeps could make the search much better and increase the precision
~f the pIdtsS ThL;rgr, the rrA!ez + r n ~ & ~ U y ~ y&*&g y& p p gGr& the
database should be solved. Since one of the features of the Web is that it changes
fiequentiy, mahg the web pages in the database adapted to the changes is very important
for obtaining the latest web information.
The second improvement that could be made is related to the recursive query In Our
experirnent, we tried to pose recursive quenes on hypertext hierarchy without cycles and it
worked fine. But we had some troubles in using recursive quenes on hypertext structure
with a cycle. The reasons are as foiiows: (1) the version of DB2 we used cannot fully
support transitive closure when there are cycles; and (2) it is difncdt to design the
stopping niles for the recursive queries because sometimes the Somation provided by
web pages is incomplete. Since recursive quenes are powerful and vety important, how to
use them properly in querying arbitrary web pages should be well studied in the fùnire.
Another extension to our prototype system is the development of a user-fiendly
interface for spec*g quenes. This interface could be developed in the spirit of Query By
ExampIe.
L. V. S. Lakshmanan, F. Saciri, and 1. N. Subramanian. A Deckùrrive Languuge
for Queryihg and Restructunkg the Web. In Proc. of 6th Intemational Workshop
on Research issues in Data Engineering, EUDE'96, IEEE, New Orleans, Feb. 1996.
G Arocena, A Mendelzon, G. Mihaila. Applications of a Web Query Lmgztage.
In Proc. of Sixth International World Wide Web Conference, 1997.
A Mendelzon, G. Mihaila, and T. Milo. Querying the Wodd Wide Web. In Proc.
of First International Conference on Pardel and Distributed Lnformation Systems
(PD1Sy96), December 1996.
Bebo White. H ï M L and the Art of Authonng for the World Wide Web. Kluwer
Academic Publishers. 1996.
T h Berners-Lee, Robert Cailliau, Ari luotonen, Henrik Frystyk Nielsen, and
Arthur Secret. The World- Wzde Web. Communications of the ACM, Vol 3 7,
No. 8. Aug. 94.
Martijn Koster. Robots in the Web: ntreat or Treat?
http://web.nexor.co.uk~users/mak/doc/robots/threat-~r-treat.html
S . Abiteboul, S. Cluet, and T. Milo. Querying and Upciating the File. In Proc. of
the Corf on Very Large Databases (VLDB), Morgan Kaufbann, 1993.
V. Christophides, S. Cluet, S. Abiteboul and M. Scholl. F m Stmctured
Documents to Novel Query FaciZities. In ACM S IGMOD International Conference
on Management of Data, 1994.
S. Abiteboul D. Quass, J. McHugh, J. W~dom, J. me Lorel @ery
h g u a g e for Semi-Sn-uctured Data. 1 996. http://www-db . stanford. edu.
S. Abiteboul. Querying S e m i - S ~ u r e d Data. In Proc. of the International
Conference on Database Theory, Delphi, Greece, Springer, 1997.
P. Atzeni, G. Mecca and P. Merialdo. Semistructureci and Sfructwed Duta in the
Web: Going Back and Forth. ACM SIGMOD Record, Vol. 26, No. 4, December
1997,
N. Ashish and C. Knoblock. Semi-Aufornotic Wrupper Generation for Intenter
InfomafiOn S m e s . http://www.isi.edulsimd{ naveen, knoblock) -
D. Konopnicki and 0. Shmueii. W3QS: A Query System for the WorId Wi& Web.
In Proc. of the 2 1 st VLDB Conference, Morgan Kaufinaag 1995.
C . Jenkins, M. Jackson, P. Burden and J. Wallis. Semching the World Wide Web:
An EvaIuatim of AvdabIe Tools ami Methodologies- In Proc. of Information and
Software Technology, Vol. 3 9, p98 5-994, Butterworth ScientSc, 1998
Serge Abiteboui, Richard Hd and Victor Vianu. F-om of Dafabases.
p 150- 156. Addison-Wesley Publishing Company, 1995.
U. Manber and P. A. Bigot. Connecting Diverse Web Search FacÏïities. Data
Engineering Bulletin, IEEE, Vol. 2 1 No. 2, June 1998.
A O. Mendelzon and T. Milo. FormaiModels of Web Queries. In Proc. of the
16th ACM Symposium on Principles of Database Systems, Tucson, Arizona,
h A ~ x r 1007- 'O-J ""
P. Buneman. Semistruc~uredDafa: A Tutoriaï. In Proc. of PODS, pages 1 17- 12 1,
Tuscon, Arizona, May 1997.
S. Nestorov, S. Abiteboul and R Motwani. Infemhg Structure Ïn Semisrrtrctured
Data. ACM Sigrnod Record, Vol. 26, No. 4, December 1997.
Y. Papakonstantinou, H- Garcia-Molina and J. Widorn. Object Exchange Across
Heterogenems Injoration Sources. in EEE International C onference on Data
Engineering, March 1995.
P. Buneman, S. Davidson, M. Fernandez and D. Suciu. Ading Sfnrctwe to
U m n ~ c ~ u r e d Data. In Proc. of ICDT, Springer, January 1997.
D. Quass, A. Rjaraman, Y. Sagiv and J. b a n . Querying Semistnrchrred
Heterogeneous Infmdon. In Proc. of the 4th International Conference on
Deductive and Object-Oriented Databases, S p ~ g e r , December 1995.
V. Bush. As We Mcs, Thinkk Atlantic Monthly. Vol. 176, NO. 1. 1945.
D. Chamberlin. OSing the Nou DB2. p 1-35. Morgan Kaufinam Publishers, Inc.
1996.