29 june 2005 eecs department university of kansas improving query retrieval times in the temporal...

29 June 2005 EECS Department University of Kansas

Improving Query Retrieval Times in the Temporal Search Engine

By Ryan Sheahan

Committee Chair: Dr. Susan GauchCommittee Member: Dr. Perry AlexanderCommittee Member: Dr. Nancy Kinnersley

University of Kansas Ryan Sheahan 2

Outline

Motivation and GoalsRelated WorkSystem DetailsExperiments and ResultsConclusionsFuture Work


Motivation

Conventional search engines do not store old versions of websites.

By keeping a version history we can:Save content of a pageAnswer questions of changes over timeTrack the evolution of web pages

The Temporal Search Engine accomplishes these tasks, but needs improvement.


Goals

Implement the Temporal Search Engine, correcting the logic error.

Modify the indexing to support temporal indexing.

Show the benefits during the retrieval phase of the modified project.


Related Work

Temporal KnowledgeTime Transaction DatabasesSource Code Control SystemsVersioning Online Documents


Related Work

Defining temporal knowledge: Time points Time intervals

Time-Transaction Databases Valid Time Transaction Time

Source Code Control Systems SCCS RCS


Related Work

Versioning Online documentsWhen to create new versions of

documents? Edit-based or Copy-based tracking?

Version control for online documentsTemporal stamps within documentsTemporal tracking by servers


System Details

System OverviewSpider FunctionalityDatabaseIndexingRetrievalImprovementsScreenshots


System Overview

A search engine has 3 primary parts:The spider collects web pages.The indexer collates the information in the

web pages into a searchable file.The retrieval aspect gives a user interface

that allows searching of the index file.

The Temporal Search Engine also utilizes a database to track versions.


System Overview

Collected pages

SpiderSpider

Temporal Indexer

Web Browser

Web Browser

Query Engine

IndexedFiles

Results

Database

Query & Range

FilenamesFile

Record

Query & Range

Filenames

Figure 1


Spider Functionality

The spider is run daily using WGET. When new pages are found they are added to

the database and stored. Previously collected pages are compared to

the stored version then using diff:Changed pages are added to the database

and stored for indexing.Unchanged pages are discarded.


Database - MySQL

The database is used to keep a record of the collected pages

There are 3 fields for each record.Description Field Datatype Example

Uniform Resource Locator

URL Stringhttp://

www.pbs.org/index.html

Date when this file was added

date_spidered String 20050322

File name used in indexing

Filename String 91.htmlTable 1


File System

The collected pages are stored in a publicly accessible directory.

This directory contains sub-directories named by year, month, and day. e.g. 20050322.

Each version is stored in a dated directory, based on its collection date


Indexing

An index is an easily searchable file of the information in the archived web pages.

Pages are pre-processed to remove unnecessary information.

A list of keywords is generated that are in each document and stored

A list of documents that each keyword was found in is stored in a separate file.


The Index

A Dictionary record has three parts: wordnumber of documents the word occurs inoffset in the Postings file

A Postings record has two parts: file nameweight of the word in that file


The Index

The pilot Temporal Search Engine created a separate index for each day that was archived.

Dictionary File Postings File

Word # of Docs Offset

Temporal 3 2

• • •

• • •

• • •

Filename Weight

54.html 0.008223.html 0.0043 119.html 0.0003 • • • •

1234567

Figure 2


Index Directory Structure

• • •

Indexed_Pages

20050322

1.html2.html3.html

••

Dictionary.txtPostings.txt

20050323

1.html2.html3.html

••


2005XXXX

1.html2.html3.html

••


Since the original system only searches files in the user specified range, results can be missed.

Figure 3


Retrieval

A user’s query is quickly looked up in a Dictionary file since it is a hash table.

The Postings file shows us the associated documents for the user’s query for a specific day.

To return a page to a user, we find which day it was archived and display the appropriate page.


Retrieval Error

Each day’s index only includes pages that have been modified, older unchanged pages will not appear.

Pages that do not specifically change within the user specified range will not be shown.


Retrieval Error

2005 03 24Index

2005 03 23Index

2005 03 22Index

Dict Post

cat 72.html 34.html 10.html 19.html

Dict Post cat 72.html 10.html

Dict Post cat 72.html 14.html

Query: cat Start Date: 2005 03 23 End Date: 2005 03 24

Only 2005 03 23 and 2005 03 24 would be accessed. Pages 34.html and 19.html would not be returned, even though

they should be.

Figure 4


Fixing Retrieval

Although the user may not notice this error, it is a fairly serious flaw in the system design.

We must loop over the entire archive from the beginning up to the user entered end date.

This is the base system against which we will compare our improvements.


Additional Features

Users can review all versions of a document.

They can view changes between two documents.

Users can sort results by date or relevance.


Improvements

Create a single, temporal index that contains all files.

A directory name and a filename creates a unique identifier for each file.

The temporal index simplifies the retrieval process, since we do not need to loop over several dictionary files.


Temporal Index Retrieval

A single lookup in the Dictionary file is needed.

Then parse the records from the Postings file to get the archival date and the filename.

Using the date we can filter files that are in the user’s specified range.

Filename Weight20050322_54.html 0.004220050329_54.html 0.0033

20050404_119.html 0.0029

20050327_15.html 0.0012

• •

• •

• •

Figure 5


Query Screen

Figure 6


Results Screen

Figure 7


All Versions Screen

Figure 8


File Comparison Screen

Figure 9


Experiments and Results

Data SetTest CasesRetrieval ImprovementsIndexing Costs


Data Set

The following URL’s were used to gather test data from:

The websites were tracked for 14 days.

1. www.ittc.ku.edu 7. www.research.ku.edu

2. www.kuhistory.com 8. www.engr.ku.edu

3. www.career.engr.ku.edu 9. www.kslegislature.org

4. www.jocoelection.org 10. www.cartoonnetwork.com

5. www.eecs.ku.edu 11. www.fidelity.com

6. jobs.ku.edu 12. www.pbs.org

Table 2


Pages Collected Per Day

Day/Site 1 2 3 4 5 6 7 8 9 10 11 12

03 22 22 74 1 428 30 1 472 0 0 0 0 0

03 23 1 0 1 1 1 0 2 0 0 0 0 0

03 24 1 2 1 25 2 0 7 52 40 52 66 256

03 25 1 0 1 3 0 0 3 0 16 6 3 100

03 26 1 1 1 1 0 0 5 0 1 3 23 180

03 27 1 0 1 1 0 0 0 0 1 3 23 137

03 28 1 0 1 1 0 0 0 0 0 1 23 168

03 29 1 0 1 5 0 0 3 0 2 7 26 108

03 30 1 0 1 3 0 0 4 0 1 1 23 174

03 31 1 2 1 6 0 0 5 0 1 1 24 139

04 01 1 1 1 1 0 0 8 0 1 1 23 154

04 02 2 0 1 1 0 0 5 0 0 1 24 156

04 03 1 0 1 1 0 0 1 0 1 1 23 122

04 04 1 20 1 1 0 0 0 0 1 1 23 145

Total 36 100 14 478 33 1 515 52 65 78 304 1839

Table 3


Test Cases

12 queries were used over a variable range of days.

Queries contained between one and four words.

One Word Two Word Three Word Four Word

computer current news buy car cheap usa election voter turnout

longevityphilosophical

argumentslowest market

ratecuring cancer technology

advancement

test pigeon holecareer intern

positionsharmful effects television

children

Table 4


Test Cases

Each query was tested over a range, starting at just the first day in the archive and expanding to include all 14 days.

The average retrieval time for the multiple-index system was 12.71 seconds at its peak.

The highest average retrieve time of the temporal index system was 7.51 seconds.


Average Retrieval Time

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

se

c)

Temporal Index Multi-Index

Figure 10


Complexity of Query

The complexity of queries is a factor in retrieval time Single word queries have similar speeds.

Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14

computer 3.25 3.25 4.44 4.81 5.02

5.29 5.52 5.74 5.98 6.27 6.50 6.74 6.93 7.14

longevity 0.82 0.81 0.90 0.89 0.89 0.89 0.90 0.91 0.90 0.90 0.90 0.90 0.90 0.95

test 2.18 2.16 2.61 2.71 2.98 3.11 3.27 3.40 3.61 3.79 3.99 4.20 4.38 4.48

Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14

computer 2.95 2.93 3.94 4.18 4.25 4.49

4.71 4.99 5.23 5.65 5.88 6.01 6.10 6.32

longevity 0.78 0.79 0.86 0.86 0.86 0.86 0.85 0.85 0.86 0.86 0.86 0.85 0.85 0.91

test 2.15 2.14 2.59 2.68 2.89 3.06 3.22 3.35 3.56 3.73 3.92 4.15 4.30 4.47

Table 5 - Multiple-index

Table 6 – Temporal Index


Complexity of Query

Here are the times for the queries: curing cancer technology advancement harmful effects television children

1 2 3 4 5 6 7 8 9 10 11 12 13 14

4.31 3.90 8.09 8.03 9.64 12.35 11.87 13.03 14.73 15.65 16.99 18.39 19.30 20.85

2.71 2.61 6.21 8.14 11.52 13.10 14.97 16.61 19.36 21.57 22.95 25.22 27.52 29.04

1 2 3 4 5 6 7 8 9 10 11 12 13 14

3.22 3.09 4.64 4.93 5.29 5.82 6.66 6.73 7.31 7.57 8.81 8.37 8.76 10.02

2.37 2.26 4.03 4.80 6.25 7.19 8.13 9.12 9.09 10.09 11.14 12.09 13.15 14.50

Table 7 – Multi-index

Table 8 – Temporal Index


Retrieval Time over Reverse Ranges

Test each query from the last day of the archive. Then the last two days of the archive, and so forth.

The average times were more parallel than in the previous test.

In both systems there is a filter to examine if a page is the most recent version causing extra database checks.

Our search actually becomes faster as the range increases in this test case.


Average Reverse Retrieval Time

0.00

5.00

10.00

15.00

20.00

25.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

se

c)

Temporal Index Mult-Index

Figure 11


Effectiveness of Retrieval

We conducted a test to prove we corrected the retrieval error.

Test query Longevity 27 March 2005 to

4 April 2005

Figure 12 - Original System


Effectiveness of Retrieval

Results from the modified systems.

We accurately find all documents.

Figure 13 - Fixed System


Effects of Update Rate

To determine the effect updating has on retrieval time, we split out the fast updating sites.

Fast updating sites had 2,143 pages.Slow updating sites had 1,372 pages.We tested the queries only on a fourteen

day range.


Effects of Update Rate

QueryFast Updating sites

Time (sec)Slow updating sites

Time (sec)

computer 2.84 6.99

longevity 1.03 0.87

test 2.80 3.54

current news 10.03 8.63

philosophical arguments 3.33 8.42

pigeon hole 1.43 1.05

buy car cheap 5.53 1.63

lowest market rate 4.36 2.61

career intern positions 8.29 9.99

usa election voter turnout 4.77 15.83

curing cancer technology advancement

8.23 7.15

harmful effects television children 14.73 4.15

Average Time 5.61 5.91Table 9


Indexing Costs

Creating and maintaining a single index is an expensive process.

The temporal index must be rebuilt every day.

There is a significant cost in comparison to a small daily index that can be created and used without modification.


Index Build Times

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

min

)

Temporal Index Multi-Index

Figure 14


Index Space Costs

The temporal index uses less storage than the multiple-index system.

The temporal index Dictionary does not grow as quickly since many words are shared across documents collected on subsequent days.

The Postings files are exactly identical in size however.


Comparison of Dictionary Size

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Siz

e (

MB

)

Temporal Index Dictionary Multi-Index Individual Dictionary

Multi-Index Total Dictionary

Figure 15


Comparison of Postings Size

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Siz

e (

MB

)

Temporal Index Postings Multi-Index Individual Postings

Multi-Index Total Postings

Figure 16


Conclusions

The only accurate search over a multiple-index system is by starting at the beginning of the archive.

We have shown that temporal index retrieval times are faster than a multiple-index system. The decrease in time comes from only needing a single

lookup in a Dictionary. The complexity of the query does affect retrieval. Searching from the end of the archive increases

retrieval times, but the temporal index is still quicker. The update rate of a site has an impact on retrieval

times, but is not the only dominant factor.


Conclusions

The tradeoff is the cost of building the temporal index every time new information is added.

This disadvantage is unseen to the user and only costs time in system resources.

The temporal index system also requires less space due to the single dictionary file.


Future Work on the Temporal Search Engine

Developing a method to incrementally build a temporal index would greatly improve the efficiency of indexing in the Temporal Search Engine.

The database backend could be extended to handle more information. With this more accurate information, improvements could be made to retrieval times.

Modify the use of diff with the spider to look for content changes instead of any change.


Future Work with the Temporal Search Engine

Look at using web servers to track version information instead of using a spider to map websites.

Examine the possibility of storing only the changes between documents instead of entire new documents, similar to RCS.

The Temporal Search Engine may be better served over smaller sites that update less frequently. Thoroughly test the effect of update rate on retrieval

and index times.

29 June 2005 EECS Department University of Kansas

Thank you for your time

Questions?