29 june 2005 eecs department university of kansas improving query retrieval times in the temporal...

52
29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan Gauch Committee Member: Dr. Perry Alexander Committee Member: Dr. Nancy Kinnersley

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

29 June 2005 EECS Department University of Kansas

Improving Query Retrieval Times in the Temporal Search Engine

By Ryan Sheahan

Committee Chair: Dr. Susan GauchCommittee Member: Dr. Perry AlexanderCommittee Member: Dr. Nancy Kinnersley

Page 2: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 2

Outline

Motivation and GoalsRelated WorkSystem DetailsExperiments and ResultsConclusionsFuture Work

Page 3: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 3

Motivation

Conventional search engines do not store old versions of websites.

By keeping a version history we can:Save content of a pageAnswer questions of changes over timeTrack the evolution of web pages

The Temporal Search Engine accomplishes these tasks, but needs improvement.

Page 4: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 4

Goals

Implement the Temporal Search Engine, correcting the logic error.

Modify the indexing to support temporal indexing.

Show the benefits during the retrieval phase of the modified project.

Page 5: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 5

Related Work

Temporal KnowledgeTime Transaction DatabasesSource Code Control SystemsVersioning Online Documents

Page 6: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 6

Related Work

Defining temporal knowledge: Time points Time intervals

Time-Transaction Databases Valid Time Transaction Time

Source Code Control Systems SCCS RCS

Page 7: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 7

Related Work

Versioning Online documentsWhen to create new versions of

documents? Edit-based or Copy-based tracking?

Version control for online documentsTemporal stamps within documentsTemporal tracking by servers

Page 8: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 8

System Details

System OverviewSpider FunctionalityDatabaseIndexingRetrievalImprovementsScreenshots

Page 9: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 9

System Overview

A search engine has 3 primary parts:The spider collects web pages.The indexer collates the information in the

web pages into a searchable file.The retrieval aspect gives a user interface

that allows searching of the index file.

The Temporal Search Engine also utilizes a database to track versions.

Page 10: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 10

System Overview

Collected pages

SpiderSpider

Temporal Indexer

Web Browser

Web Browser

Query Engine

IndexedFiles

Results

Database

Query & Range

FilenamesFile

Record

Query & Range

Filenames

Figure 1

Page 11: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 11

Spider Functionality

The spider is run daily using WGET. When new pages are found they are added to

the database and stored. Previously collected pages are compared to

the stored version then using diff:Changed pages are added to the database

and stored for indexing.Unchanged pages are discarded.

Page 12: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 12

Database - MySQL

The database is used to keep a record of the collected pages

There are 3 fields for each record.Description Field Datatype Example

Uniform Resource Locator

URL Stringhttp://

www.pbs.org/index.html

Date when this file was added

date_spidered String 20050322

File name used in indexing

Filename String 91.htmlTable 1

Page 13: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 13

File System

The collected pages are stored in a publicly accessible directory.

This directory contains sub-directories named by year, month, and day. e.g. 20050322.

Each version is stored in a dated directory, based on its collection date

Page 14: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 14

Indexing

An index is an easily searchable file of the information in the archived web pages.

Pages are pre-processed to remove unnecessary information.

A list of keywords is generated that are in each document and stored

A list of documents that each keyword was found in is stored in a separate file.

Page 15: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 15

The Index

A Dictionary record has three parts: wordnumber of documents the word occurs inoffset in the Postings file

A Postings record has two parts: file nameweight of the word in that file

Page 16: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 16

The Index

The pilot Temporal Search Engine created a separate index for each day that was archived.

Dictionary File Postings File

Word # of Docs Offset

Temporal 3 2

• • •

• • •

• • •

Filename Weight

54.html 0.008223.html 0.0043 119.html 0.0003 • • • •

1234567

Figure 2

Page 17: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 17

Index Directory Structure

• • •

Indexed_Pages

20050322

1.html2.html3.html

••

Dictionary.txtPostings.txt

20050323

1.html2.html3.html

••

Dictionary.txtPostings.txt

2005XXXX

1.html2.html3.html

••

Dictionary.txtPostings.txt

Since the original system only searches files in the user specified range, results can be missed.

Figure 3

Page 18: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 18

Retrieval

A user’s query is quickly looked up in a Dictionary file since it is a hash table.

The Postings file shows us the associated documents for the user’s query for a specific day.

To return a page to a user, we find which day it was archived and display the appropriate page.

Page 19: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 19

Retrieval Error

Each day’s index only includes pages that have been modified, older unchanged pages will not appear.

Pages that do not specifically change within the user specified range will not be shown.

Page 20: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 20

Retrieval Error

2005 03 24Index

2005 03 23Index

2005 03 22Index

Dict Post

cat 72.html 34.html 10.html 19.html

Dict Post cat 72.html 10.html

Dict Post cat 72.html 14.html

Query: cat Start Date: 2005 03 23 End Date: 2005 03 24

Only 2005 03 23 and 2005 03 24 would be accessed. Pages 34.html and 19.html would not be returned, even though

they should be.

Figure 4

Page 21: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 21

Fixing Retrieval

Although the user may not notice this error, it is a fairly serious flaw in the system design.

We must loop over the entire archive from the beginning up to the user entered end date.

This is the base system against which we will compare our improvements.

Page 22: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 22

Additional Features

Users can review all versions of a document.

They can view changes between two documents.

Users can sort results by date or relevance.

Page 23: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 23

Improvements

Create a single, temporal index that contains all files.

A directory name and a filename creates a unique identifier for each file.

The temporal index simplifies the retrieval process, since we do not need to loop over several dictionary files.

Page 24: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 24

Temporal Index Retrieval

A single lookup in the Dictionary file is needed.

Then parse the records from the Postings file to get the archival date and the filename.

Using the date we can filter files that are in the user’s specified range.

Filename Weight20050322_54.html 0.004220050329_54.html 0.0033

20050404_119.html 0.0029

20050327_15.html 0.0012

• •

• •

• •

Figure 5

Page 25: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 25

Query Screen

Figure 6

Page 26: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 26

Results Screen

Figure 7

Page 27: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 27

All Versions Screen

Figure 8

Page 28: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 28

File Comparison Screen

Figure 9

Page 29: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 29

Experiments and Results

Data SetTest CasesRetrieval ImprovementsIndexing Costs

Page 30: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 30

Data Set

The following URL’s were used to gather test data from:

The websites were tracked for 14 days.

1. www.ittc.ku.edu 7. www.research.ku.edu

2. www.kuhistory.com 8. www.engr.ku.edu

3. www.career.engr.ku.edu 9. www.kslegislature.org

4. www.jocoelection.org 10. www.cartoonnetwork.com

5. www.eecs.ku.edu 11. www.fidelity.com

6. jobs.ku.edu 12. www.pbs.org

Table 2

Page 31: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 31

Pages Collected Per Day

Day/Site 1 2 3 4 5 6 7 8 9 10 11 12

03 22 22 74 1 428 30 1 472 0 0 0 0 0

03 23 1 0 1 1 1 0 2 0 0 0 0 0

03 24 1 2 1 25 2 0 7 52 40 52 66 256

03 25 1 0 1 3 0 0 3 0 16 6 3 100

03 26 1 1 1 1 0 0 5 0 1 3 23 180

03 27 1 0 1 1 0 0 0 0 1 3 23 137

03 28 1 0 1 1 0 0 0 0 0 1 23 168

03 29 1 0 1 5 0 0 3 0 2 7 26 108

03 30 1 0 1 3 0 0 4 0 1 1 23 174

03 31 1 2 1 6 0 0 5 0 1 1 24 139

04 01 1 1 1 1 0 0 8 0 1 1 23 154

04 02 2 0 1 1 0 0 5 0 0 1 24 156

04 03 1 0 1 1 0 0 1 0 1 1 23 122

04 04 1 20 1 1 0 0 0 0 1 1 23 145

Total 36 100 14 478 33 1 515 52 65 78 304 1839

Table 3

Page 32: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 32

Test Cases

12 queries were used over a variable range of days.

Queries contained between one and four words.

One Word Two Word Three Word Four Word

computer current news buy car cheap usa election voter turnout

longevityphilosophical

argumentslowest market

ratecuring cancer technology

advancement

test pigeon holecareer intern

positionsharmful effects television

children

Table 4

Page 33: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 33

Test Cases

Each query was tested over a range, starting at just the first day in the archive and expanding to include all 14 days.

The average retrieval time for the multiple-index system was 12.71 seconds at its peak.

The highest average retrieve time of the temporal index system was 7.51 seconds.

Page 34: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 34

Average Retrieval Time

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

se

c)

Temporal Index Multi-Index

Figure 10

Page 35: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 35

Complexity of Query

The complexity of queries is a factor in retrieval time Single word queries have similar speeds.

Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14

computer 3.25 3.25 4.44 4.81 5.02

5.29 5.52 5.74 5.98 6.27 6.50 6.74 6.93 7.14

longevity 0.82 0.81 0.90 0.89 0.89 0.89 0.90 0.91 0.90 0.90 0.90 0.90 0.90 0.95

test 2.18 2.16 2.61 2.71 2.98 3.11 3.27 3.40 3.61 3.79 3.99 4.20 4.38 4.48

Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14

computer 2.95 2.93 3.94 4.18 4.25 4.49

4.71 4.99 5.23 5.65 5.88 6.01 6.10 6.32

longevity 0.78 0.79 0.86 0.86 0.86 0.86 0.85 0.85 0.86 0.86 0.86 0.85 0.85 0.91

test 2.15 2.14 2.59 2.68 2.89 3.06 3.22 3.35 3.56 3.73 3.92 4.15 4.30 4.47

Table 5 - Multiple-index

Table 6 – Temporal Index

Page 36: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 36

Complexity of Query

Here are the times for the queries: curing cancer technology advancement harmful effects television children

1 2 3 4 5 6 7 8 9 10 11 12 13 14

4.31 3.90 8.09 8.03 9.64 12.35 11.87 13.03 14.73 15.65 16.99 18.39 19.30 20.85

2.71 2.61 6.21 8.14 11.52 13.10 14.97 16.61 19.36 21.57 22.95 25.22 27.52 29.04

1 2 3 4 5 6 7 8 9 10 11 12 13 14

3.22 3.09 4.64 4.93 5.29 5.82 6.66 6.73 7.31 7.57 8.81 8.37 8.76 10.02

2.37 2.26 4.03 4.80 6.25 7.19 8.13 9.12 9.09 10.09 11.14 12.09 13.15 14.50

Table 7 – Multi-index

Table 8 – Temporal Index

Page 37: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 37

Retrieval Time over Reverse Ranges

Test each query from the last day of the archive. Then the last two days of the archive, and so forth.

The average times were more parallel than in the previous test.

In both systems there is a filter to examine if a page is the most recent version causing extra database checks.

Our search actually becomes faster as the range increases in this test case.

Page 38: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 38

Average Reverse Retrieval Time

0.00

5.00

10.00

15.00

20.00

25.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

se

c)

Temporal Index Mult-Index

Figure 11

Page 39: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 39

Effectiveness of Retrieval

We conducted a test to prove we corrected the retrieval error.

Test query Longevity 27 March 2005 to

4 April 2005

Figure 12 - Original System

Page 40: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 40

Effectiveness of Retrieval

Results from the modified systems.

We accurately find all documents.

Figure 13 - Fixed System

Page 41: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 41

Effects of Update Rate

To determine the effect updating has on retrieval time, we split out the fast updating sites.

Fast updating sites had 2,143 pages.Slow updating sites had 1,372 pages.We tested the queries only on a fourteen

day range.

Page 42: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 42

Effects of Update Rate

QueryFast Updating sites

Time (sec)Slow updating sites

Time (sec)

computer 2.84 6.99

longevity 1.03 0.87

test 2.80 3.54

current news 10.03 8.63

philosophical arguments 3.33 8.42

pigeon hole 1.43 1.05

buy car cheap 5.53 1.63

lowest market rate 4.36 2.61

career intern positions 8.29 9.99

usa election voter turnout 4.77 15.83

curing cancer technology advancement

8.23 7.15

harmful effects television children 14.73 4.15

Average Time 5.61 5.91Table 9

Page 43: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 43

Indexing Costs

Creating and maintaining a single index is an expensive process.

The temporal index must be rebuilt every day.

There is a significant cost in comparison to a small daily index that can be created and used without modification.

Page 44: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 44

Index Build Times

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Tim

e (

min

)

Temporal Index Multi-Index

Figure 14

Page 45: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 45

Index Space Costs

The temporal index uses less storage than the multiple-index system.

The temporal index Dictionary does not grow as quickly since many words are shared across documents collected on subsequent days.

The Postings files are exactly identical in size however.

Page 46: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 46

Comparison of Dictionary Size

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Siz

e (

MB

)

Temporal Index Dictionary Multi-Index Individual Dictionary

Multi-Index Total Dictionary

Figure 15

Page 47: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 47

Comparison of Postings Size

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Days

Siz

e (

MB

)

Temporal Index Postings Multi-Index Individual Postings

Multi-Index Total Postings

Figure 16

Page 48: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 48

Conclusions

The only accurate search over a multiple-index system is by starting at the beginning of the archive.

We have shown that temporal index retrieval times are faster than a multiple-index system. The decrease in time comes from only needing a single

lookup in a Dictionary. The complexity of the query does affect retrieval. Searching from the end of the archive increases

retrieval times, but the temporal index is still quicker. The update rate of a site has an impact on retrieval

times, but is not the only dominant factor.

Page 49: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 49

Conclusions

The tradeoff is the cost of building the temporal index every time new information is added.

This disadvantage is unseen to the user and only costs time in system resources.

The temporal index system also requires less space due to the single dictionary file.

Page 50: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 50

Future Work on the Temporal Search Engine

Developing a method to incrementally build a temporal index would greatly improve the efficiency of indexing in the Temporal Search Engine.

The database backend could be extended to handle more information. With this more accurate information, improvements could be made to retrieval times.

Modify the use of diff with the spider to look for content changes instead of any change.

Page 51: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

University of Kansas Ryan Sheahan 51

Future Work with the Temporal Search Engine

Look at using web servers to track version information instead of using a spider to map websites.

Examine the possibility of storing only the changes between documents instead of entire new documents, similar to RCS.

The Temporal Search Engine may be better served over smaller sites that update less frequently. Thoroughly test the effect of update rate on retrieval

and index times.

Page 52: 29 June 2005 EECS Department University of Kansas Improving Query Retrieval Times in the Temporal Search Engine By Ryan Sheahan Committee Chair: Dr. Susan

29 June 2005 EECS Department University of Kansas

Thank you for your time

Questions?