ryan caplet
TRANSCRIPT
![Page 1: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/1.jpg)
Ryan CapletCSE 4904 – Fall 08Milestone 4Nov 5, 2008
Crawler Based Search Engine
Introduction:
Purpose and Scope:
The purpose of this document is to explain the progress of the project on a
Crawler-Based search engine. In this part of the project, the group is beginning to bring
the different parts of the project together so they can work together. This part will
highlight the different parts of the project that the group has worked on and how the
group is hoping to make the project eventually work. There is also some discussion on
how the system as a whole is tested and how everything is going to work together that the
group has designed. There is a part of this document that is going to be discussing a very
thorough strategy on how to integrate each part of the system. There will be criteria on
when a part is ready to be tested and there will be specifics on how it is to be tested. The
testing will then prepare the rest of the project for testing. There is also some criteria on
when the system is ready for beta testing by users.
Integration Strategy:
Entry Criteria:
The criterion that is needed for the parts of the system to be tested is this: the first
party or the creator of the part of the system needs to be able to run it and know that it
works. Then a second party needs to run it and verify that it is correct. In the sequence
explained later the earlier parts of the project need to be carried out in order to make the
testing for the later parts be more effective.
Page 1
![Page 2: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/2.jpg)
Elements to be integrated:
There are several different parts that make up the search engine. So far the group
has developed the crawler, the indexer, the keyword generator and the search function for
the engine. The crawler is the part of project that analyzes the web page and builds an
index of websites that are downloaded, for other parts of the project to allow quicker
access. The keyword generator is the part of the system which builds many tables of
words that so when people search a word they can find which sites have those words.
Then there is the search function which searches the keywords and produces a list of
URLs related to that word. The crawler, indexer and keywords are separate to the rest of
the project because they are run once in a while to develop their respective lists. Within
this project, every part depends on something else in order to run correctly. The
crawler/indexer and the keywords are already integrated together and the search is the
only part that is not directly integrated yet. The crawler, new_strip.pl, is written in Perl,
the index is a MySql table, and the keyword building script is written in both PHP and
Perl. Keyword.php is the keyword building script and that also uses the
processKeyword.pl Perl script.
Integration Strategy:
The group decided to organize the project into the separate parts that were
discussed above because not every part of this project is going to be running at all times.
The crawler, index and keywords are going to change only if the database administrator
decides that they want to update the keywords and the index. The project’s integration
approach is designed in more of a bottom-up approach. The reason that the project was
designed like this was because there are clear separate parts in the crawler based search
Page 2
![Page 3: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/3.jpg)
engine that is being made. The group each built and tested the separate parts as the
project was made. The group members each were given a task within the project to do.
Sequence of Features / Functions and Software Integration:
The sequence of the features is the recursive wget, which will retrieve the pages,
the crawler, which will build the index and the keyword generator, which will build the
tables for each keyword. These keywords will then be searched by the search page and
the results will displayed on the search page. These subsystems are integrated in this
order because each one depends on the one immediately before it.
These features do not have any real specific hardware dependencies but there are
some software dependencies that are needed. The project requires PHP 4 or higher with
the MySql module and Perl, which is standard with Linux in order to work. Also the
server needs to be Linux and have wget installed so that one can download the pages
needed.
Individual Steps:
The way the group knows that everything has worked so far up to this point is
each part was tested with the next part. First the recursive wget was used to download all
the web pages on the UConn network. This information was used by the crawler/indexer
to create the index of the web site. The crawler must build a clean index with no blanks
in the title and no duplicates. The keywords generator script uses the indexer for quick
access to the each of the files so it can read in keywords and create new database tables
for each of the words. This one is very hard to test at the large scale but going through
without any errors throughout the run time is sufficient. The search has basic
functionality but it is going to search the word tables that are created using the keyword
Page 3
![Page 4: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/4.jpg)
script. The search right now is using a test database to test the functionality of it.
Eventually the search page will use the keyword database tables to search for words
amongst the UConn web site.
Software & Subsystem Integration Test Description:
The project had multiple part and sub sections which are used to make up the
entire system. The basic functionality of the sub system functions such as the quicksort
and reverse functions discuss more in depth later are tested using an array of numbers.
Once Ryan knew that these two functions worked he put them into the search script and
modified them so that they can use two dimensional arrays.
As far as separate software sections being integrated each part of the project is, as
mentioned through this document, dependent on the part that was made and tested before
it. In order to test each part of the system one needs to run the section before it to obtain
the data for that part. The search engine itself is reliant on the keywords which could be
generated once and left for one to use to search. There shouldn’t be a need to run a
constant update of all the pages every time one searches.
Final Functional Tests:
The functional tests that will be done at the end of this project are going to have
the search web page interact with the keyword databases. The web page should print the
results on either a new page or the same page. If the search doesn’t find a word then it
should have adequate error handling and return no results. If multiple words are searched
for then the search should find both words and make it an addition of the two frequencies
of those two words. The group has already tested the new strip Perl script which will
read in the information from the web pages and make the index database. This
Page 4
![Page 5: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/5.jpg)
information is needed to produce the keywords. The keyword script which was written in
PHP and Perl has also been tested because the data needed for the search function relies
on the keywords. This has been tested and does work but takes a long time to run.
Exit Criteria:
The final criterion that must define during the entire testing process is when the
system will be ready for beta testing. The search is the main part of the system that the
group must worry about because it is what users are going to be using. A few bugs that
still need to be worked out are implementing multiple words and being able to return no
results for a word table not being found. We will also eventually add some security to
prevent server side commands and SQL injections as the group goes along with the
testing. Obvious bugs as just mentioned are what is going to need to work in order for
this system to be allowed to be use by the public.
Program Stubs and Test Data Required:
Some program stubs that are currently included in the project can be as small as
included functions within a part of the project to the part of the project itself. The crawler
is a major part of the beginning of this project. Its program stub is new_strip.pl which is
a Perl script. I will take the contents of the folder /home/<user>/www.engr.uconn.edu/
and index all the pages recursively into a MySql database. This requires the web pages in
the directory that the script specifies.
The keyword generator is the next major part of the program. The name of the
script is keyword.php but it depends on a few other scripts such as process2.php which
has tools used to strip the html tags so that the system can collect the keywords. This and
the processkeyword.pl Perl script are used to build the MySql database tables for each
Page 5
![Page 6: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/6.jpg)
word. Perl was used here because PHP was having memory leaks. This needs the index
built in order to run.
The search is within the results.php file which is an html file with PHP code
added in within the html code. This right now is the main page where one will be able to
search the UConn web site. This part of the system uses the keywords from the previous
part to make a list of URLs to each of the web sites that matches the input criteria. This
entire part of the program is tested by entering a word into the text box and pressing
submit and if the word matches the criteria it should return information found from those
keywords.
Some sub-program stubs which were part of the search are the
quicksortRecursive(array) and reverse(array) functions. Quicksort is the algorithm that is
used to sort the results by frequency of the keyword but it sorts them least to greatest.
Reverse is used to flip the contents so the data in the array will be greatest to least. Ryan
chose the recursive quicksort algorithm so that if there was a big list of web pages that
the speed of the sort would be as fast as possible. These two functions were tested within
a separate environment. The quicksort function was tested with an array of numbers and
then was modified so that is could be used with a two dimensional array. The reverse
function was also tested with an array. There might be more stubs that will be added to
the system as the group works some of the bugs out of the search function and allows it to
search using multiple words.
Responsibilities and Schedules:
Roles and Responsibilities:
Page 6
![Page 7: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/7.jpg)
Bryan has been doing most of the coding for the crawler and the keywords.
Morris has been working on doing the search function and been the main administrator
for MySql and the partition the group gets on the network. Ryan has been working to
design a new cleaner interface and has helped with small parts of the crawler and is
helping with the search. Morris and Ryan for the most part are involved in testing the
search right now and adding functionality to it.
The project will be completed in the next week and the testing has begun as the
group’s progress through the project has gone along. Some parts of the project the group
knows work, such as the crawler and the keyword builder, but is rigorous to run because
they can take a lot of time to complete. The search however once completed will need to
have extensive testing so that it knows what correct information to display and to make
sure that it can work with the database of keywords.
Key Dependencies:
Some dependencies at the moment that might hinder the testing process is that the
group is not sure how to get the two gigabytes of web page data on to the five hundred
megabyte partition of space we get for our web page. Also the keywords database takes
forever to build because there are 1000’s of word within the web pages. Mostly
everything that has been tested is on Bryan’s laptop and some of it is on Ryan’s laptop.
Building the keywords tables take a long time and it would take a while on the server
unless the tables were transported manually. In order to do that the group would need
root privileges to the server which might be too much for a system administrator to give
to anyone.
Risks and Assumptions:
Page 7
![Page 8: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/8.jpg)
There are some risks and assumptions that are part of the testing schedule and
process that will need some attention. The main assumption/risk is that we have enough
space for all of the pages in case we might need to update the database. The pages take
up about two gigabytes which is much bigger than our partition. The keywords database
also might be a problem because it too is very large and even if it was copied over by
itself the group would need to have administrative access to that server. This would
make the whole process of testing very complicated.
Schedules:
As far as a testing schedule the group loosely assigned jobs to each of the
members. Over time, however, the jobs were swapped amongst the group members and
some of us helped the other or others on parts of the project and tested it along the way to
make sure that it was working properly. Ryan verified most of the work that Bryan did
by testing it on his own machine. There were some times when Bryan asked Ryan to
work on some small parts of the crawler/indexer such as the filling of the title field for
blanks in the web site index. Currently Ryan and Morris are working on the search
function and the testing is going on as we go along. The search should be well tested
going into the next week. Aside from any other formal testing most of the testing is
completed since each part needs to work in order for the next part to work. An example
of this is the crawler to build the index and then the index needed to build the keywords
database and the keywords to use with the search.
Problem Recording and Resolution:
When dealing with problems that came up when developing each part of the
search engine system, the group found solutions of how to deal with these problems.
Page 8
![Page 9: Ryan Caplet](https://reader038.vdocument.in/reader038/viewer/2022102923/5514879d497959f31d8b4795/html5/thumbnails/9.jpg)
Bryan ran into a problem with the PHP code and the way that he dealt with it was to work
around the problem. The job of developing the search functionality changed a few times
due to the unorganized nature of our group. Both of these points were revisions and
terminations that we dealt with within the group.
So far each part of the program has worked as expected the way the group has
wanted it to. A major reworking that occurred in the developmental process was with the
most recent project: The Keyword Generator. Initially the script was written in fully in
PHP. Since the process was so extensive the PHP code could not handle the data due to
memory leaks. Bryan reworked that part by rewriting the code in Perl which ended up
not having memory leaks and working the way he wanted it to work.
The plan mapping out which group members did what has been changed slightly
due to the unstructured nature of our group throughout the time working on the project.
It started where Ryan was going to do the search function and Morris was going to do the
user interface. The plan changed when Morris was changed to work on the search engine
and Ryan on the user interface. Now both of us are doing parts of the search function
since it is the next big part of the project that is needed to get done. These plans were
changed those few times because the group knew we needed to get these parts done and
whoever was most available took on that job.
Page 9