ryan caplet

Ryan CapletCSE 4904 – Fall 08Milestone 4Nov 5, 2008

Crawler Based Search Engine

Introduction:

Purpose and Scope:

The purpose of this document is to explain the progress of the project on a

Crawler-Based search engine. In this part of the project, the group is beginning to bring

the different parts of the project together so they can work together. This part will

highlight the different parts of the project that the group has worked on and how the

group is hoping to make the project eventually work. There is also some discussion on

how the system as a whole is tested and how everything is going to work together that the

group has designed. There is a part of this document that is going to be discussing a very

thorough strategy on how to integrate each part of the system. There will be criteria on

when a part is ready to be tested and there will be specifics on how it is to be tested. The

testing will then prepare the rest of the project for testing. There is also some criteria on

when the system is ready for beta testing by users.

Integration Strategy:

Entry Criteria:

The criterion that is needed for the parts of the system to be tested is this: the first

party or the creator of the part of the system needs to be able to run it and know that it

works. Then a second party needs to run it and verify that it is correct. In the sequence

explained later the earlier parts of the project need to be carried out in order to make the

testing for the later parts be more effective.

Elements to be integrated:

There are several different parts that make up the search engine. So far the group

has developed the crawler, the indexer, the keyword generator and the search function for

the engine. The crawler is the part of project that analyzes the web page and builds an

index of websites that are downloaded, for other parts of the project to allow quicker

access. The keyword generator is the part of the system which builds many tables of

words that so when people search a word they can find which sites have those words.

Then there is the search function which searches the keywords and produces a list of

URLs related to that word. The crawler, indexer and keywords are separate to the rest of

the project because they are run once in a while to develop their respective lists. Within

this project, every part depends on something else in order to run correctly. The

crawler/indexer and the keywords are already integrated together and the search is the

only part that is not directly integrated yet. The crawler, new_strip.pl, is written in Perl,

the index is a MySql table, and the keyword building script is written in both PHP and

Perl. Keyword.php is the keyword building script and that also uses the

processKeyword.pl Perl script.

Integration Strategy:

The group decided to organize the project into the separate parts that were

discussed above because not every part of this project is going to be running at all times.

The crawler, index and keywords are going to change only if the database administrator

decides that they want to update the keywords and the index. The project’s integration

approach is designed in more of a bottom-up approach. The reason that the project was

designed like this was because there are clear separate parts in the crawler based search

engine that is being made. The group each built and tested the separate parts as the

project was made. The group members each were given a task within the project to do.

Sequence of Features / Functions and Software Integration:

The sequence of the features is the recursive wget, which will retrieve the pages,

the crawler, which will build the index and the keyword generator, which will build the

tables for each keyword. These keywords will then be searched by the search page and

the results will displayed on the search page. These subsystems are integrated in this

order because each one depends on the one immediately before it.

These features do not have any real specific hardware dependencies but there are

some software dependencies that are needed. The project requires PHP 4 or higher with

the MySql module and Perl, which is standard with Linux in order to work. Also the

server needs to be Linux and have wget installed so that one can download the pages

needed.

Individual Steps:

The way the group knows that everything has worked so far up to this point is

each part was tested with the next part. First the recursive wget was used to download all

the web pages on the UConn network. This information was used by the crawler/indexer

to create the index of the web site. The crawler must build a clean index with no blanks

in the title and no duplicates. The keywords generator script uses the indexer for quick

access to the each of the files so it can read in keywords and create new database tables

for each of the words. This one is very hard to test at the large scale but going through

without any errors throughout the run time is sufficient. The search has basic

functionality but it is going to search the word tables that are created using the keyword

script. The search right now is using a test database to test the functionality of it.

Eventually the search page will use the keyword database tables to search for words

amongst the UConn web site.

Software & Subsystem Integration Test Description:

The project had multiple part and sub sections which are used to make up the

entire system. The basic functionality of the sub system functions such as the quicksort

and reverse functions discuss more in depth later are tested using an array of numbers.

Once Ryan knew that these two functions worked he put them into the search script and

modified them so that they can use two dimensional arrays.

As far as separate software sections being integrated each part of the project is, as

mentioned through this document, dependent on the part that was made and tested before

it. In order to test each part of the system one needs to run the section before it to obtain

the data for that part. The search engine itself is reliant on the keywords which could be

generated once and left for one to use to search. There shouldn’t be a need to run a

constant update of all the pages every time one searches.

Final Functional Tests:

The functional tests that will be done at the end of this project are going to have

the search web page interact with the keyword databases. The web page should print the

results on either a new page or the same page. If the search doesn’t find a word then it

should have adequate error handling and return no results. If multiple words are searched

for then the search should find both words and make it an addition of the two frequencies

of those two words. The group has already tested the new strip Perl script which will

read in the information from the web pages and make the index database. This

information is needed to produce the keywords. The keyword script which was written in

PHP and Perl has also been tested because the data needed for the search function relies

on the keywords. This has been tested and does work but takes a long time to run.

Exit Criteria:

The final criterion that must define during the entire testing process is when the

system will be ready for beta testing. The search is the main part of the system that the

group must worry about because it is what users are going to be using. A few bugs that

still need to be worked out are implementing multiple words and being able to return no

results for a word table not being found. We will also eventually add some security to

prevent server side commands and SQL injections as the group goes along with the

testing. Obvious bugs as just mentioned are what is going to need to work in order for

this system to be allowed to be use by the public.

Program Stubs and Test Data Required:

Some program stubs that are currently included in the project can be as small as

included functions within a part of the project to the part of the project itself. The crawler

is a major part of the beginning of this project. Its program stub is new_strip.pl which is

a Perl script. I will take the contents of the folder /home/<user>/www.engr.uconn.edu/

and index all the pages recursively into a MySql database. This requires the web pages in

the directory that the script specifies.

The keyword generator is the next major part of the program. The name of the

script is keyword.php but it depends on a few other scripts such as process2.php which

has tools used to strip the html tags so that the system can collect the keywords. This and

the processkeyword.pl Perl script are used to build the MySql database tables for each

word. Perl was used here because PHP was having memory leaks. This needs the index

built in order to run.

The search is within the results.php file which is an html file with PHP code

added in within the html code. This right now is the main page where one will be able to

search the UConn web site. This part of the system uses the keywords from the previous

part to make a list of URLs to each of the web sites that matches the input criteria. This

entire part of the program is tested by entering a word into the text box and pressing

submit and if the word matches the criteria it should return information found from those

keywords.

Some sub-program stubs which were part of the search are the

quicksortRecursive(array) and reverse(array) functions. Quicksort is the algorithm that is

used to sort the results by frequency of the keyword but it sorts them least to greatest.

Reverse is used to flip the contents so the data in the array will be greatest to least. Ryan

chose the recursive quicksort algorithm so that if there was a big list of web pages that

the speed of the sort would be as fast as possible. These two functions were tested within

a separate environment. The quicksort function was tested with an array of numbers and

then was modified so that is could be used with a two dimensional array. The reverse

function was also tested with an array. There might be more stubs that will be added to

the system as the group works some of the bugs out of the search function and allows it to

search using multiple words.

Responsibilities and Schedules:

Roles and Responsibilities:

Bryan has been doing most of the coding for the crawler and the keywords.

Morris has been working on doing the search function and been the main administrator

for MySql and the partition the group gets on the network. Ryan has been working to

design a new cleaner interface and has helped with small parts of the crawler and is

helping with the search. Morris and Ryan for the most part are involved in testing the

search right now and adding functionality to it.

The project will be completed in the next week and the testing has begun as the

group’s progress through the project has gone along. Some parts of the project the group

knows work, such as the crawler and the keyword builder, but is rigorous to run because

they can take a lot of time to complete. The search however once completed will need to

have extensive testing so that it knows what correct information to display and to make

sure that it can work with the database of keywords.

Key Dependencies:

Some dependencies at the moment that might hinder the testing process is that the

group is not sure how to get the two gigabytes of web page data on to the five hundred

megabyte partition of space we get for our web page. Also the keywords database takes

forever to build because there are 1000’s of word within the web pages. Mostly

everything that has been tested is on Bryan’s laptop and some of it is on Ryan’s laptop.

Building the keywords tables take a long time and it would take a while on the server

unless the tables were transported manually. In order to do that the group would need

root privileges to the server which might be too much for a system administrator to give

to anyone.

Risks and Assumptions:

There are some risks and assumptions that are part of the testing schedule and

process that will need some attention. The main assumption/risk is that we have enough

space for all of the pages in case we might need to update the database. The pages take

up about two gigabytes which is much bigger than our partition. The keywords database

also might be a problem because it too is very large and even if it was copied over by

itself the group would need to have administrative access to that server. This would

make the whole process of testing very complicated.

Schedules:

As far as a testing schedule the group loosely assigned jobs to each of the

members. Over time, however, the jobs were swapped amongst the group members and

some of us helped the other or others on parts of the project and tested it along the way to

make sure that it was working properly. Ryan verified most of the work that Bryan did

by testing it on his own machine. There were some times when Bryan asked Ryan to

work on some small parts of the crawler/indexer such as the filling of the title field for

blanks in the web site index. Currently Ryan and Morris are working on the search

function and the testing is going on as we go along. The search should be well tested

going into the next week. Aside from any other formal testing most of the testing is

completed since each part needs to work in order for the next part to work. An example

of this is the crawler to build the index and then the index needed to build the keywords

database and the keywords to use with the search.

Problem Recording and Resolution:

When dealing with problems that came up when developing each part of the

search engine system, the group found solutions of how to deal with these problems.

Bryan ran into a problem with the PHP code and the way that he dealt with it was to work

around the problem. The job of developing the search functionality changed a few times

due to the unorganized nature of our group. Both of these points were revisions and

terminations that we dealt with within the group.

So far each part of the program has worked as expected the way the group has

wanted it to. A major reworking that occurred in the developmental process was with the

most recent project: The Keyword Generator. Initially the script was written in fully in

PHP. Since the process was so extensive the PHP code could not handle the data due to

memory leaks. Bryan reworked that part by rewriting the code in Perl which ended up

not having memory leaks and working the way he wanted it to work.

The plan mapping out which group members did what has been changed slightly

due to the unstructured nature of our group throughout the time working on the project.

It started where Ryan was going to do the search function and Morris was going to do the

user interface. The plan changed when Morris was changed to work on the search engine

and Ryan on the user interface. Now both of us are doing parts of the search function

since it is the next big part of the project that is needed to get done. These plans were

changed those few times because the group knew we needed to get these parts done and

whoever was most available took on that job.

ryan caplet

Documents