29 june 2005 eecs department university of kansas improving query retrieval times in the temporal...
Post on 15-Jan-2016
213 views
TRANSCRIPT
29 June 2005 EECS Department University of Kansas
Improving Query Retrieval Times in the Temporal Search Engine
By Ryan Sheahan
Committee Chair: Dr. Susan GauchCommittee Member: Dr. Perry AlexanderCommittee Member: Dr. Nancy Kinnersley
University of Kansas Ryan Sheahan 2
Outline
Motivation and GoalsRelated WorkSystem DetailsExperiments and ResultsConclusionsFuture Work
University of Kansas Ryan Sheahan 3
Motivation
Conventional search engines do not store old versions of websites.
By keeping a version history we can:Save content of a pageAnswer questions of changes over timeTrack the evolution of web pages
The Temporal Search Engine accomplishes these tasks, but needs improvement.
University of Kansas Ryan Sheahan 4
Goals
Implement the Temporal Search Engine, correcting the logic error.
Modify the indexing to support temporal indexing.
Show the benefits during the retrieval phase of the modified project.
University of Kansas Ryan Sheahan 5
Related Work
Temporal KnowledgeTime Transaction DatabasesSource Code Control SystemsVersioning Online Documents
University of Kansas Ryan Sheahan 6
Related Work
Defining temporal knowledge: Time points Time intervals
Time-Transaction Databases Valid Time Transaction Time
Source Code Control Systems SCCS RCS
University of Kansas Ryan Sheahan 7
Related Work
Versioning Online documentsWhen to create new versions of
documents? Edit-based or Copy-based tracking?
Version control for online documentsTemporal stamps within documentsTemporal tracking by servers
University of Kansas Ryan Sheahan 8
System Details
System OverviewSpider FunctionalityDatabaseIndexingRetrievalImprovementsScreenshots
University of Kansas Ryan Sheahan 9
System Overview
A search engine has 3 primary parts:The spider collects web pages.The indexer collates the information in the
web pages into a searchable file.The retrieval aspect gives a user interface
that allows searching of the index file.
The Temporal Search Engine also utilizes a database to track versions.
University of Kansas Ryan Sheahan 10
System Overview
Collected pages
SpiderSpider
Temporal Indexer
Web Browser
Web Browser
Query Engine
IndexedFiles
Results
Database
Query & Range
FilenamesFile
Record
Query & Range
Filenames
Figure 1
University of Kansas Ryan Sheahan 11
Spider Functionality
The spider is run daily using WGET. When new pages are found they are added to
the database and stored. Previously collected pages are compared to
the stored version then using diff:Changed pages are added to the database
and stored for indexing.Unchanged pages are discarded.
University of Kansas Ryan Sheahan 12
Database - MySQL
The database is used to keep a record of the collected pages
There are 3 fields for each record.Description Field Datatype Example
Uniform Resource Locator
URL Stringhttp://
www.pbs.org/index.html
Date when this file was added
date_spidered String 20050322
File name used in indexing
Filename String 91.htmlTable 1
University of Kansas Ryan Sheahan 13
File System
The collected pages are stored in a publicly accessible directory.
This directory contains sub-directories named by year, month, and day. e.g. 20050322.
Each version is stored in a dated directory, based on its collection date
University of Kansas Ryan Sheahan 14
Indexing
An index is an easily searchable file of the information in the archived web pages.
Pages are pre-processed to remove unnecessary information.
A list of keywords is generated that are in each document and stored
A list of documents that each keyword was found in is stored in a separate file.
University of Kansas Ryan Sheahan 15
The Index
A Dictionary record has three parts: wordnumber of documents the word occurs inoffset in the Postings file
A Postings record has two parts: file nameweight of the word in that file
University of Kansas Ryan Sheahan 16
The Index
The pilot Temporal Search Engine created a separate index for each day that was archived.
Dictionary File Postings File
Word # of Docs Offset
Temporal 3 2
• • •
• • •
• • •
Filename Weight
54.html 0.008223.html 0.0043 119.html 0.0003 • • • •
1234567
Figure 2
University of Kansas Ryan Sheahan 17
Index Directory Structure
• • •
Indexed_Pages
20050322
1.html2.html3.html
••
Dictionary.txtPostings.txt
20050323
1.html2.html3.html
••
Dictionary.txtPostings.txt
2005XXXX
1.html2.html3.html
••
Dictionary.txtPostings.txt
Since the original system only searches files in the user specified range, results can be missed.
Figure 3
University of Kansas Ryan Sheahan 18
Retrieval
A user’s query is quickly looked up in a Dictionary file since it is a hash table.
The Postings file shows us the associated documents for the user’s query for a specific day.
To return a page to a user, we find which day it was archived and display the appropriate page.
University of Kansas Ryan Sheahan 19
Retrieval Error
Each day’s index only includes pages that have been modified, older unchanged pages will not appear.
Pages that do not specifically change within the user specified range will not be shown.
University of Kansas Ryan Sheahan 20
Retrieval Error
2005 03 24Index
2005 03 23Index
2005 03 22Index
Dict Post
cat 72.html 34.html 10.html 19.html
Dict Post cat 72.html 10.html
Dict Post cat 72.html 14.html
Query: cat Start Date: 2005 03 23 End Date: 2005 03 24
Only 2005 03 23 and 2005 03 24 would be accessed. Pages 34.html and 19.html would not be returned, even though
they should be.
Figure 4
University of Kansas Ryan Sheahan 21
Fixing Retrieval
Although the user may not notice this error, it is a fairly serious flaw in the system design.
We must loop over the entire archive from the beginning up to the user entered end date.
This is the base system against which we will compare our improvements.
University of Kansas Ryan Sheahan 22
Additional Features
Users can review all versions of a document.
They can view changes between two documents.
Users can sort results by date or relevance.
University of Kansas Ryan Sheahan 23
Improvements
Create a single, temporal index that contains all files.
A directory name and a filename creates a unique identifier for each file.
The temporal index simplifies the retrieval process, since we do not need to loop over several dictionary files.
University of Kansas Ryan Sheahan 24
Temporal Index Retrieval
A single lookup in the Dictionary file is needed.
Then parse the records from the Postings file to get the archival date and the filename.
Using the date we can filter files that are in the user’s specified range.
Filename Weight20050322_54.html 0.004220050329_54.html 0.0033
20050404_119.html 0.0029
20050327_15.html 0.0012
• •
• •
• •
Figure 5
University of Kansas Ryan Sheahan 25
Query Screen
Figure 6
University of Kansas Ryan Sheahan 26
Results Screen
Figure 7
University of Kansas Ryan Sheahan 27
All Versions Screen
Figure 8
University of Kansas Ryan Sheahan 28
File Comparison Screen
Figure 9
University of Kansas Ryan Sheahan 29
Experiments and Results
Data SetTest CasesRetrieval ImprovementsIndexing Costs
University of Kansas Ryan Sheahan 30
Data Set
The following URL’s were used to gather test data from:
The websites were tracked for 14 days.
1. www.ittc.ku.edu 7. www.research.ku.edu
2. www.kuhistory.com 8. www.engr.ku.edu
3. www.career.engr.ku.edu 9. www.kslegislature.org
4. www.jocoelection.org 10. www.cartoonnetwork.com
5. www.eecs.ku.edu 11. www.fidelity.com
6. jobs.ku.edu 12. www.pbs.org
Table 2
University of Kansas Ryan Sheahan 31
Pages Collected Per Day
Day/Site 1 2 3 4 5 6 7 8 9 10 11 12
03 22 22 74 1 428 30 1 472 0 0 0 0 0
03 23 1 0 1 1 1 0 2 0 0 0 0 0
03 24 1 2 1 25 2 0 7 52 40 52 66 256
03 25 1 0 1 3 0 0 3 0 16 6 3 100
03 26 1 1 1 1 0 0 5 0 1 3 23 180
03 27 1 0 1 1 0 0 0 0 1 3 23 137
03 28 1 0 1 1 0 0 0 0 0 1 23 168
03 29 1 0 1 5 0 0 3 0 2 7 26 108
03 30 1 0 1 3 0 0 4 0 1 1 23 174
03 31 1 2 1 6 0 0 5 0 1 1 24 139
04 01 1 1 1 1 0 0 8 0 1 1 23 154
04 02 2 0 1 1 0 0 5 0 0 1 24 156
04 03 1 0 1 1 0 0 1 0 1 1 23 122
04 04 1 20 1 1 0 0 0 0 1 1 23 145
Total 36 100 14 478 33 1 515 52 65 78 304 1839
Table 3
University of Kansas Ryan Sheahan 32
Test Cases
12 queries were used over a variable range of days.
Queries contained between one and four words.
One Word Two Word Three Word Four Word
computer current news buy car cheap usa election voter turnout
longevityphilosophical
argumentslowest market
ratecuring cancer technology
advancement
test pigeon holecareer intern
positionsharmful effects television
children
Table 4
University of Kansas Ryan Sheahan 33
Test Cases
Each query was tested over a range, starting at just the first day in the archive and expanding to include all 14 days.
The average retrieval time for the multiple-index system was 12.71 seconds at its peak.
The highest average retrieve time of the temporal index system was 7.51 seconds.
University of Kansas Ryan Sheahan 34
Average Retrieval Time
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Days
Tim
e (
se
c)
Temporal Index Multi-Index
Figure 10
University of Kansas Ryan Sheahan 35
Complexity of Query
The complexity of queries is a factor in retrieval time Single word queries have similar speeds.
Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14
computer 3.25 3.25 4.44 4.81 5.02
5.29 5.52 5.74 5.98 6.27 6.50 6.74 6.93 7.14
longevity 0.82 0.81 0.90 0.89 0.89 0.89 0.90 0.91 0.90 0.90 0.90 0.90 0.90 0.95
test 2.18 2.16 2.61 2.71 2.98 3.11 3.27 3.40 3.61 3.79 3.99 4.20 4.38 4.48
Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14
computer 2.95 2.93 3.94 4.18 4.25 4.49
4.71 4.99 5.23 5.65 5.88 6.01 6.10 6.32
longevity 0.78 0.79 0.86 0.86 0.86 0.86 0.85 0.85 0.86 0.86 0.86 0.85 0.85 0.91
test 2.15 2.14 2.59 2.68 2.89 3.06 3.22 3.35 3.56 3.73 3.92 4.15 4.30 4.47
Table 5 - Multiple-index
Table 6 – Temporal Index
University of Kansas Ryan Sheahan 36
Complexity of Query
Here are the times for the queries: curing cancer technology advancement harmful effects television children
1 2 3 4 5 6 7 8 9 10 11 12 13 14
4.31 3.90 8.09 8.03 9.64 12.35 11.87 13.03 14.73 15.65 16.99 18.39 19.30 20.85
2.71 2.61 6.21 8.14 11.52 13.10 14.97 16.61 19.36 21.57 22.95 25.22 27.52 29.04
1 2 3 4 5 6 7 8 9 10 11 12 13 14
3.22 3.09 4.64 4.93 5.29 5.82 6.66 6.73 7.31 7.57 8.81 8.37 8.76 10.02
2.37 2.26 4.03 4.80 6.25 7.19 8.13 9.12 9.09 10.09 11.14 12.09 13.15 14.50
Table 7 – Multi-index
Table 8 – Temporal Index
University of Kansas Ryan Sheahan 37
Retrieval Time over Reverse Ranges
Test each query from the last day of the archive. Then the last two days of the archive, and so forth.
The average times were more parallel than in the previous test.
In both systems there is a filter to examine if a page is the most recent version causing extra database checks.
Our search actually becomes faster as the range increases in this test case.
University of Kansas Ryan Sheahan 38
Average Reverse Retrieval Time
0.00
5.00
10.00
15.00
20.00
25.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Days
Tim
e (
se
c)
Temporal Index Mult-Index
Figure 11
University of Kansas Ryan Sheahan 39
Effectiveness of Retrieval
We conducted a test to prove we corrected the retrieval error.
Test query Longevity 27 March 2005 to
4 April 2005
Figure 12 - Original System
University of Kansas Ryan Sheahan 40
Effectiveness of Retrieval
Results from the modified systems.
We accurately find all documents.
Figure 13 - Fixed System
University of Kansas Ryan Sheahan 41
Effects of Update Rate
To determine the effect updating has on retrieval time, we split out the fast updating sites.
Fast updating sites had 2,143 pages.Slow updating sites had 1,372 pages.We tested the queries only on a fourteen
day range.
University of Kansas Ryan Sheahan 42
Effects of Update Rate
QueryFast Updating sites
Time (sec)Slow updating sites
Time (sec)
computer 2.84 6.99
longevity 1.03 0.87
test 2.80 3.54
current news 10.03 8.63
philosophical arguments 3.33 8.42
pigeon hole 1.43 1.05
buy car cheap 5.53 1.63
lowest market rate 4.36 2.61
career intern positions 8.29 9.99
usa election voter turnout 4.77 15.83
curing cancer technology advancement
8.23 7.15
harmful effects television children 14.73 4.15
Average Time 5.61 5.91Table 9
University of Kansas Ryan Sheahan 43
Indexing Costs
Creating and maintaining a single index is an expensive process.
The temporal index must be rebuilt every day.
There is a significant cost in comparison to a small daily index that can be created and used without modification.
University of Kansas Ryan Sheahan 44
Index Build Times
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Days
Tim
e (
min
)
Temporal Index Multi-Index
Figure 14
University of Kansas Ryan Sheahan 45
Index Space Costs
The temporal index uses less storage than the multiple-index system.
The temporal index Dictionary does not grow as quickly since many words are shared across documents collected on subsequent days.
The Postings files are exactly identical in size however.
University of Kansas Ryan Sheahan 46
Comparison of Dictionary Size
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Days
Siz
e (
MB
)
Temporal Index Dictionary Multi-Index Individual Dictionary
Multi-Index Total Dictionary
Figure 15
University of Kansas Ryan Sheahan 47
Comparison of Postings Size
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Days
Siz
e (
MB
)
Temporal Index Postings Multi-Index Individual Postings
Multi-Index Total Postings
Figure 16
University of Kansas Ryan Sheahan 48
Conclusions
The only accurate search over a multiple-index system is by starting at the beginning of the archive.
We have shown that temporal index retrieval times are faster than a multiple-index system. The decrease in time comes from only needing a single
lookup in a Dictionary. The complexity of the query does affect retrieval. Searching from the end of the archive increases
retrieval times, but the temporal index is still quicker. The update rate of a site has an impact on retrieval
times, but is not the only dominant factor.
University of Kansas Ryan Sheahan 49
Conclusions
The tradeoff is the cost of building the temporal index every time new information is added.
This disadvantage is unseen to the user and only costs time in system resources.
The temporal index system also requires less space due to the single dictionary file.
University of Kansas Ryan Sheahan 50
Future Work on the Temporal Search Engine
Developing a method to incrementally build a temporal index would greatly improve the efficiency of indexing in the Temporal Search Engine.
The database backend could be extended to handle more information. With this more accurate information, improvements could be made to retrieval times.
Modify the use of diff with the spider to look for content changes instead of any change.
University of Kansas Ryan Sheahan 51
Future Work with the Temporal Search Engine
Look at using web servers to track version information instead of using a spider to map websites.
Examine the possibility of storing only the changes between documents instead of entire new documents, similar to RCS.
The Temporal Search Engine may be better served over smaller sites that update less frequently. Thoroughly test the effect of update rate on retrieval
and index times.
29 June 2005 EECS Department University of Kansas
Thank you for your time
Questions?