by: jamie mcpeek. 1. background information 1. metasearch 2. sets 3. surface web/deep web 4. the...

25
By: Jamie McPeek

Upload: sibyl-hodges

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

By: Jamie McPeek

Page 2: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

1. Background Information1. Metasearch2. Sets3. Surface Web/Deep Web4. The Problem5. Application Goals

Page 3: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Metasearch is a way of searching any number of other search engines to create and generate more precise data.

Metasearch engines began appearing on the web in the mid 90s.

Many metasearch engines exist today but are generally overlooked due to Google’s grasp on the “searching” industry.

An example: www.info.com

Page 4: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Set is a collection of similar objects with no repeated values.

For our purposes, a set consists of any number of web pages; the name of the set being based on the search engine that returns the values.

When viewing all sets returned from all search engines, there is no longer a set as there may be many repeated values.

Removing search engines to lower/remove redundancy is an NP-Complete problem.

Page 5: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

A cover is a collection of sets that contain at the very least one of each element contained in the “universe”; in our case, at least one of each web page from the total amount of web pages returned.

Keeping the “cover” in mind, we have a goal: Remove as many search engines as possible

while maintaining a cover. If possible, remove all redundant search engines.

This creates a minimal cover.

Page 6: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

The surface web is any content that is publicly available or accessible on the internet.

The deep web is all other content. Data accessible only through an on-site query

system. Generated on-the-fly data. Frequently updated/changed content.

Most normal search engines are incapable of retrieving data from the deep web or catch it in only one state.

Page 7: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Various search engines catching different states would allow a compilation from them to provide a clearly picture of the actual content.

Specialized and site-only search systems can almost always be generalized to allow remote searching.

With the above in mind, metasearch becomes an intriguing idea as a way to view not only the surface web, but the deep web as well.

Page 8: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

A finite and known number of search engines each return a set of a finite and known number of web pages.

Across all search engines, there may be redundancy. The idea is to remove as many unnecessary search engines from the meta set as possible while leaving a complete cover of the web pages.

Accuracy (relative to the true minimal cover) and speed are the most important aspects.

Page 9: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Using two different languages, we want to: Compare the accuracy and speed of two

different algorithms. Compare different structures for the data

based on the same algorithm. Assess the impact of “regions” on overall

time. Regions are a way of grouping elements based

on which search engines they are in. At most one element in a region is necessary, all others are fully redundant.

Page 10: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

2. Source Code1. Original Setup2. Key Structures – C3. Key Structures – C++4. Procedure5. Reasons For Changing

Page 11: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

System: UW-Platteville’s IO System Language: C

Minor work on this code as it was already written.

Used as a baseline for improvements. Managed using subversion to allow

rollbacks and “checkins”.

Page 12: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Structure for storing the sets. Each web page is mapped to a specific bit. Bitwise operators are used for

accessing/editing a specific bit. list[index] &= ~(1 << (position %

(sizeof(unsigned int) << 3)));

Page 13: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Structure for storing web pages. Stored using a tree for faster insertion. BITMAP in this instance stores the specific

search engine that the document exists in. nID allows reference back to the specific web

page instead of dragging the string around the entire time.

Page 14: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

The bitmap structure as changed for C++. Added some variables to reduce “fixed” calculations. Added variables to hold new data available when

reading in the web pages. Converted to a Class – OOP.

Page 15: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

A new structure implemented for C++. A two-dimensional grid of these nodes

implemented as a linked list. Eliminates “empty” bits.

Structure is self-destructive in use. Access coordinate based on matrix (i, j)

Page 16: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

1. Read web pages in from file.1. Number of search engines.2. Number of documents per search engine.

2. Store each incoming web page as a node in a balanced tree.

1. Total number of web pages.2. Total number of unique web pages.

3. Setup whichever structure is to be used based on the numbers learned from reading in and storing the web pages.

Page 17: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Populate the structures based on the data available in the tree. This can be the original tree or the region

tree. The bitmap structure is stored in two ways

1. Search engine “major” – used in original C code.

2. Document “major” – used in new C++ code.

Run one of the two algorithms over the structure and document the results. Cover size and amount of time taken.

Page 18: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Personal preference. I’m significantly more familiar with C++ than I am with C.

Additional compiler options for the language.

OOP. Additional language features.

Inline functions. Operator overloading.

More readable code.

Page 19: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

3. Algorithms1. Greedy Algorithm2. Check and Remove (CAR) Algorithm

4. Results1. Data Sets2. Baseline Results3. Updated (C++) Results

5. Regions1. Impact (Pending)

Page 20: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Straight-forward, brute force. Add the largest set, then the next largest, and

so on. Easily translated to code. Makes no provision for removing

redundant sets after reaching a cover set.

Page 21: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Less direct approach; adds based on “uncovered” elements.

Remove phase makes a single pass at removing any redundant sets.

Page 22: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

The structures and algorithms were tested on moderately large to very large data sets. Number of documents range from 100,000 to

1,000,000. The number of search engines was constant at

1,000. Distribution was uniform (all search engines

contained the same number of documents). Non-uniform sets were tested by Dr. Qi. It

apparently worked or he would have let me know.

Page 23: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Greedy Min: 1,500 Seconds Max: 29,350 Seconds

8h 9m 10s

CAR Min: 16.5 Seconds Max: 138.5

Seconds

CPU Time (Seconds)

0

5000

10000

15000

20000

25000

30000

35000

Documents

Tim

e (

Seco

nd

s)

CAR

Greedy

Page 24: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

Greedy Min: 4.5 Sec. Max: 19.25 Sec.

CAR Min: 1.0 Sec. Max: 7.75 Sec.

CPU Time (Seconds)

0

5

10

15

20

25

100,0

00

150,0

00

200,0

00

400,0

00

600,0

00

800,0

00

1,00

0,000

Documents

Tim

e (S

eco

nd

s)

CAR

Greedy

Matrix CAR

Matrix Greedy

Matrix (Both) Min: 0.20 Sec. Max: 0.40 Sec.

Page 25: By: Jamie McPeek. 1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals

The idea is to find and remove redundant web pages in an intermediary step between reading data and performing the algorithm.

Redundant web pages are determined based on the search engines that contain them.

Currently the process of removing these web pages takes more time than it saves. This is not true for the baseline code as the run-

time of the algorithms is significantly longer. Not determined whether this can be

improved.