thesis final on project search
TRANSCRIPT
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 1/68
CHAPTER I
INTRODUCTION
1.1 E-BUSINESS
Electronic-Business commonly referred to as "e-business" or an internet business, defined as
the application of information and communication technologies (ICT) in support of all the
activities of business, uch as buying and selling of the products! nd all the business related
transaction automated #nline! E$ample of e-business %eb sites are ma&on!com, ebay!com,
flipcart!com, napdeal!com and many more! 'o% a day, the uantity of available information
is rapidly rising on E-business! #ne of the most important goals of these e-businesses to
provide relevant information to user at the clic of mouse!
In such a huge, fragmented and unstructured information collection today*s greatest problem
is to find relevant information! +or this online information retrieval system %e are using the
machine learning such as genetic algorithm to find relevant information!
1.2 INFORMATION RETRIEVAL
Information retrieval (I) is finding material (usually documents) of an unstructured nature
(usually te$t) that satisfies an information need from %ithin large collections (usually stored
on computers)!s defined in this %ay, information retrieval used to be an activity that only a
fe% people engaged in reference librarians, paralegals, and similar professional searchers!
'o% the %orld has changed, and hundreds of millions of people engage in information
retrieval every day %hen they use a %eb search engine or search their email! Information
retrieval is fast becoming the dominant form of information access, overtaing traditional
database-style searching!
I can also cover other inds of data and information problems beyond that specified in the
core definition above! The term .unstructured data/ refers to data %hich does not have clear,
semantically overt, easy-for-a-computer structure! It is the opposite of structured data, the
canonical e$ample of %hich is a relational database, of the sort companies usually use to
1
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 2/68
maintain product inventories and personnel records! The field of information retrieval also
covers supporting users in bro%sing or filtering document collections or further processing a
set of retrieved documents! 0iven a set of documents, clustering is the tas of coming up %ith
a good grouping of the documents based on their contents! It is similar to arranging boos on
a booshelf according to their topic!
Information retrieval systems can also be distinguished by the scale on %hich they operate,
and it is useful to distinguish three prominent scales! In %eb search, the system has to provide
search over billions of documents stored on millions of computers! 1istinctive issues are
needed to gather documents for inde$ing, being able to build systems that %or efficiently at
this enormous scale, and handling particular aspects of the %eb, such as the e$ploitation of
hyperte$t and not being fooled by site providers manipulating page content in an attempt to
boost their search engine ranings, given the commercial importance of the %eb!
Information retrieval
2! To process large document collections uicly! The amount of online data has
gro%n at least as uicly as the speed of computers, and %e %ould no% lie to be
able to search collections that total in the order of billions to trillions of %ords!
3! To allo% more fle$ible matching operations! +or e$ample, it is impractical l to
perform the uery omans 'E countrymen %ith grep, %here 'E might be
defined as .%ithin 4 %ords/ or .%ithin the same sentence/!
5! To allo% raned retrieval in many cases you %ant the best ans%er to an
information need among many documents that contain certain %ords!
1.3 WEB SEARCH ENGINE
%eb search engine is a soft%are system that is designed to search for information on the
6orld 6ide 6eb! The search results are generally presented in a line of results often referred
to as search engine results pages (E7s)! The information may be a specialist in %eb pages,
images, information and other types of files! ome search engines also mine data available in
databases or open directories! 8nlie %eb directories, %hich are maintained only by human
editors, search engines also maintain real-time information by running an algorithm on a %eb
cra%ler!
2
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 3/68
+ig 2!2 the 9arious Component of 6eb earch Engine
search engine operates in the follo%ing order
2! 6eb Cra%ling
3! Inde$ing
5! earching
1.4 WEB CRAWLING
6eb cra%ling is the process by %hich %e gather pages from the 6eb, in order to inde$ them
and support a search engine! The ob:ective of cra%ling is to uicly and efficiently gather as
many useful %eb pages as possible, together %ith the lin structure that interconnectsthem!%eb cra%ler; it is sometimes referred to as a spider!
1.4.1 FEATURES A CRAWLER MUST PROVIDE 6e list the desiderata for %eb
cra%lers in t%o categories features that %eb cra%lers must provide follo%ed by features they
should provide!
3
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 4/68
obustness The 6eb contains servers that create spider traps, %hich are generators of %eb
pages that mislead cra%lers into getting stuc fetching an infinite number of pages in a
particular domain! Cra%lers must be de-signed to be resilient to such traps! 'ot all such traps
are malicious; some are the inadvertent side-effect of faulty %ebsite development!
7oliteness 6eb servers have both implicit and e$plicit policies regulating the rate at %hich a
cra%ler can visit them! These politeness policies must be respected!
1.4.2 FEATURES A CRAWLER SHOULD PROVIDE
1istributed The cra%ler should have the ability to e$ecute in a across multiple machines!
calable The cra%ler architecture should permit scaling up the cra%l adding e$tra machines
and band%idth!
7erformance and efficiency The cra%l system should mae efficient use of various system
resources including processor, storage and net%or band-%idth!
<uality 0iven that a significant fraction of all %eb pages are of poor utility for serving user
uery needs, the cra%ler should be biased to%ards fetching .useful/ pages first!
+reshness In many applications, the cra%ler should operate in continuous mode it should
obtain fresh copies of previously fetched pages! search engine cra%ler, for instance, can
thus ensure that the search engine*s inde$ contains a fairly current representation of each
inde$ed %eb page! +or such continuous cra%ling, a cra%ler should be able to cra%l a page
%ith a freuency that appro$imates the rate of change of that page!
E$tensible Cra%lers should be designed to be e$tensible in many %ays =to cope %ith ne%
data formats, ne% fetch protocols, and so on! This demands that the cra%ler architecture be
modular!
1.4.3 CRAWLING
The basic operation of any hyperte$t cra%ler (%hether for the 6eb, an intranet or other
hyperte$t document collection) is as follo%s! The cra%ler begins %ith one or more 8>s that
constitute a seed set! It pics a 8> from this seed set, and then fetches the %eb page at that
8>! The fetched page is then parsed, to e$tract both the te$t and the lins from the page
(each of %hich points to another 8>)! The e$tracted te$t is fed to a te$t inde$er! Thee$tracted lins (8>s) are then added to a 8> frontier, %hich at all times consists of 8>s
4
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 5/68
%hose corresponding pages have yet to be fetched by the cra%ler! Initially, the 8> frontier
contains the seed set; as pages are fetched, the corresponding 8>s are deleted from the 8>
frontier! The entire process may be vie%ed as traversing the %eb graph! In continuous
cra%ling, the 8> of a fetched page is added bac to the frontier for fetching again in the
future!
1
+ig 2!3 %eb as 0raph
1.4.4 ARCHITECTURE OF WEB CRAWLER
1#C #B#T 8>
+7* TE?7>ETE ET
+ig 2!5 architecture of basic %eb cra%ler
5
D
A
EB
F
C
DNS
P
A
R
S
E
W
W
W
DUP
URL
ELIM
CONTEN
T
SEEN?
URL
FILTER
F
E
T
C
H
URL FRONTIER
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 6/68
1.5 PREPROCESSING
2! Collect the documents to be inde$ed!
3! Toeni&e the te$t 0iven a character seuence and a defined document unit, toeni&ation is
the tas of chopping it up into pieces, called toens, perhaps at the same time thro%ing a%ay
certain characters, such as punctuation! @ere is an e$ample of toeni&ation
Input +riends,omans ,Countrymen, >end me your ears;
#utput
5! 1o linguistic pre processing of toens!
I! 1ropping common terms stop %ords ometimes, some e$tremely common
%ords %hich %ould appear to be of little value in helping select documents
matching a user need are e$cluded from the vocabulary entirely! These %ords
are called stop %ords! E$ample
a an and are as at be by for from
has he in is it its of on that the
to %as %ere %ill %ith
II! Capitali&ationAcase-folding! common strategy is to do case-folding by
reducing all letters to lo%er case! #ften this is a good idea it %ill allo%
instances of utomobile at the beginning of a sentence to match %ith a uery
of automobile!
III! temming and lemmati&ation temming usually refers to a crude heuristic
process that chops off the ends of %ords in the hope of achieving this goal
correctly most of the time, and often includes the removal of derivational
affi$es! >emmati&ation usually refers to doing things properly %ith the use of
a vocabulary and morphological analysis of %ords, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a %ord,
%hich is no%n as the lemma! The goal of both stemming and lemmati&ation
6
RomansFr!n"s !ars#o$rm!%!n"Co$n&r#m!n
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 7/68
is to reduce inflectional forms and sometimes derivationally related forms of a
%ord to a common base form! +or instance am, are, is ' be
car, cars, car*s, cars* ' car
The result of this mapping of te$t %ill be something lie
the boy*s cars are different colors 'the boy car be differ color
The most common algorithm for stemming English, and one that has
repeatedly been sho%n to be empirically very effective, is 7orter*s algorithm
(7orter 2D)!
1.6 INDEXING IN VECTOR SPACE MODEL
1.6.1 VECTOR SPACE MODEL
The representation of a set of documents as vectors in a common vector space is no%n as
the vector space model and is fundamental to a host of information retrieval operations
ranging from scoring documents on a uery, document classification and document
clustering! In this model, a document is vie%ed as a vector in n-dimensional document space
(%here n is the number of distinguishing terms used to describe contents of the documents
in a collection) and each term represents one dimension in the document space! uery is
also treated in the same %ay and constructed from the terms and %eights provided in the user
reuest! 1ocument retrieval is based on the measurement of the similarity bet%een the uery
and the documents! This means that documents %ith a higher similarity to the uery
are :udged to be more relevant to it and should be retrieved by the I in a higher position in
the list of retrieved documents! In This method, the retrieved documents can beorderly presented to the user %ith respect to their relevance to the uery!
TE? +E<8E'C ssign to each term in a document a %eight for that, term that
depends on the number of occurrences of the term in the document! 6e %ould lie to
compute a score bet%een a uery term t and a document d, based on the %eight of t in d! The
simplest approach is to assign the %eight to be eual to the number of occurrences of term t
(
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 8/68
in document d! This %eighting scheme is referred to as term freuency and is denoted, tf t , d
%ith the subscripts denoting the term and the document in order!
1#C8?E'T +E<8E'C The document freuency df
t , defined to be the number of
documents in the collection that contain a term t!
I'9EE 1#C8?E'T +E<8E'C 1enoting the total number of documents in a
collection by ', %e define the inverse document freuency (idf) of a term t as follo%s
idf t =log N
df t
Thus the idf of a rare term is high, %hereas the idf of a freuent term is liely to be lo%!
T+-I1+ 6EI0@TI'0 6e no% combine the definitions of term freuency and inverse
document freuency, to produce a composite %eight for each term in each document! The tf-
idf %eighting scheme assigns to term t a %eight in document d given by
tf −idf t , d=tf t ,d ×idf t ……. !"
In other %ords,tf −idf t , d assigns to term t a %eight in document d that is
2! @ighest %hen t occurs many times %ithin a small number of documents(thus lending high
discriminating po%er to those documents);
3! lo%er %hen the term occurs fe%er times in a document, or occurs in many documents (thus
offering a less pronounced relevance signal);
5! lo%est %hen the term occurs in virtually all documents!
1#C8?E'T 9ECT# t this point, %e may vie% each document as a vector %ith one
component corresponding to each term in the dictionary, together %ith a %eight for each
component that is given by (i)! +or dictionary terms that do not occur in a document, this
%eight is &ero! This vector form %ill prove to be crucial to scoring and raning; step, %e
introduce the overlap score measure the score of a document d is the sum, over all uery
terms, of the number of times each of the uery terms occurs in d! 6e can refine this idea so
)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 9/68
that %e add up not the number of occurrences of each uery term t in d, but instead the tf-idf
%eight of each term in d!
Score (q , d )=∑t qϵ
tf −idf t , d
1#T 7#18CT 6e denote byV (d ) the vector derived from document d, %ith one
component in the vector for each dictionary term! Cosine imilarity, the standard %ay of
uantifying the similarity bet%een t%o documentsd1 and
d2 is to computethe cosine
similarity of their vector representations V (d1) and
V (d2)
|V (d1 )|∨V (d2 )∨¿
¿ (d1 , d2 )=V ( d1 ) ∙ V (d2)
¿……. !!"
+ig no 2!F cosine similarity illustrated
%here the numerator represents the dot product (also no%n as the inner product) of the
vectors, V (d1) and
V (d2) %hile the denominator is the product oftheir Euclidean
*
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 10/68
lengths! The dot product x ∙ y of t%o vectors is defined as ∑i=1
m
xi yi , let V (d ) denote
the document vector for d, %ith ? components V 1(d )
G!! V m(d )
The Euclidean length
of d is defined to be
√∑i=1
m
V i2
(d)
The effect of the denominator of Euation (ii) is thus to length-normali&e the vectors
V (d1) and V (d2) , to unit vectors
¿
V (d2)∨¿v (d1 )=V (d
1)/ ¿ and
¿
V (d2 )∨¿v (d2 )=V (d2)/ ¿ ! 6e can
then re%rite (ii) as
¿ (d1 , d2 )=v (d1 ) ∙ v ( d2 )
imilarly <uery is represented as the vector v (q ) and the similarity bet%een uery and
document vector is calculated as
|V (q )|∨V ( d )∨¿
Score (q , d )=V (q ) ∙ V (d )
¿
1.# GENETIC ALGORITHM
0enetic lgorithm is search algorithm based on the mechanics of natural selection and
natural genetics! They combine survival of the fittest among string structures %ith a
structured yet randomi&ed information e$change to form a search algorithm %ith some of the
innovative flair of human search!
0enetic algorithm %as developed by Hohn @olland and his colleagues in the university of
?ichigan!
1+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 11/68
1.#.1 A SIMPLE GENETIC ALGORITHM
The mechanics of a simple genetic algorithm are surprisingly simple, involving nothing more
comple$ that copying strings and s%apping partial strings! The e$planation of %hy this
simple process %ors is more subtle and po%erful!
simple genetic algorithm that yields good results in many practical problems is composed
of three operators
2! eproduction3! Crossover
5! ?utation
E7#18CTI#' is a process in %hich individual string are copied according to their
ob:ective function values, f (biologists call this function the fitness function)! Intuitively, %e
can thin of the function f as some measures of profit, utility of goodness that %e %ant to
ma$imi&e! Copying strings according to their fitness values means that string %ith a higher
value have a higher probability of contributing one or more offspring in the ne$t generation!
This operator is an artificial version of natural selection, a .1ar%inian urvival/ of the fittest
among string creature!
C##9E ?ay proceed in t%o steps
2! ?embers of the ne%ly reproduced strings in the mating pool are mated at random!
3! Each pair of strings undergoes crossing over as follo%s
a! n integer position along the strings uniformly at random bet%een 2 and the
string length less one 2, l-2J !
b! T%o ne% strings are created by s%apping all characters bet%een position K2
and l inclusively!
?8TTI#' 7lays a decidedly secondary role in the operation of genetic algorithm!
?utation is needed because, even though reproduction and crossover effectively search and
recombine e$tant notation, occasionally they may become over&ealous and lose some
potentially useful genetic material (2*s or D*s at particular locations)! In artificial genetic
systems, the mutation operator protects against such an irrecoverable loss! In simple genetic
11
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 12/68
algorithm, mutation is the occasional (%ith small probability) random alteration of the values
of string position simply means (changing 2*s or D*s and vice versa)!
1.#.2 GENETIC ALGORITHM STEPS
NO
$ES
12
,!n!ra&! In&a% -o-$%a&on
R! ro"$.&on
E/a%$a&! !a.0 n"/"$a%
Crosso/!r
M$&a&on
S&o--n Cr&!ra
m!&
STOP
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 13/68
1.% PROBLEM STATEMENT
T# #7TI?ILE1 T@E EC@ E8>T %hen the user enter the uery to search for a
particular information, the information retrieval system or the search engine retrieve the
search result %hich are both relevant and irrelevant to the user!
1.& AIM AND OB'ECTIVE
@ere our basic aim is to retrieve relevant result and to reduce the number of irrelevant result
retrieve! nd the result retrieve must be in the decreasing order of relevance!
#BHECTI9E
• Implement vector space model!
• #ptimi&e the <uery using the genetic algorithm!
• etrieve the result using optimi&ed uery!
1.1( ORGANI)ATION OF THESIS
The rest of this thesis report is organi&ed as follo%s
6e present the literature revie% in Chapter 3!
esearch analysis i!e! theoretical, computational, and analytical, are presented in
Chapter 5!
The results and discussion of thesis are presented in Chapter F!
6e present the conclusion of %hole thesis in Chapter 4!
+inally, %e present the future scope of %or in chapter M!
CHAPTER II
13
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 14/68
LITERATURE REVIEW
E-business may be defined as the conduct of industry, trade and commerce using the
computer net%ors! The term "e-business" %as coined by IBM*+ mareting and Internet
teams in 2M! 2J Electronic business methods enable companies to lin their internal and
e$ternal data processing systems more efficiently and fle$ibly, to %or more closely %ith
suppliers and partners, and to better satisfy the needs and e$pectations of their customers! The
internet is a public through %ay! +irms use more private and hence more secure net%ors for
more effective and efficient management of their internal functions! In practice, e-business is
more than :ust e-commerce! 6hile e-business refers to more strategic focus %ith an emphasis
on the functions that occur using electronic capabilities, e-commerce is a subset of an overall
e-business strategy! E-commerce sees to add revenue streams using the 6orld 6ide 6eb or
the Internet to build and enhance relationships %ith clients and partners and to improve
efficiency using the Empty 9essel strategy! #ften, e-commerce involves the application of
no%ledge management systems.
C,!+/0, D. M!, P, R,7 and H!!8, S8,9:, 3J Introduction to
Information etrieval, Cambridge 8niversity 7ress! 3DD! The first eight chapters of the boo
are devoted to the basics of information retrieval, and in particular the heart of search
engines; %e consider this material to be core of information retrieval! Chapter 2 introduces
inverted inde$es, and sho%s ho% simple Boolean ueries can be processed using such
inde$es! Chapter 3 builds on this introduction by detailing the manner in %hich documents
are preprocessed before inde$ing and by discussing ho% inverted inde$es are augmented in
various %ays for functionality and speed! Chapter 5 discusses search structures for
dictionaries and ho% to process ueries that have spelling errors and other imprecise matches
to the vocabulary in the document collection being searched! Chapter F describes a number of
algorithms for constructing the inverted inde$ from a te$t collection %ith particular attention
to highly scalable and distributed algorithms that can be applied to very large collections!
desire to measure the e$tent to %hich a document matches a uery, or the score of a document
for a uery, motivates the development of term %eighting and the computation of
14
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 15/68
cores in Chapters M, N, leading to the idea of a list of documents that are ran-ordered for a
uery! Chapter focuses on the evaluation of an information retrieval system based on the
relevance of the documents it retrieves, allo%ing us to compare the relative performances of
different systems on benchmar document collections and ueries! Chapter discusses
methods by %hich retrieval can be enhanced through the use of techniues lie relevance
feedbac and uery e$pansion, %hich aim at increasing the lielihood of retrieving relevant
document! Chapter 24 introduces support vector machines, %hich many researchers currently
vie% as the most effective te$t classification method! 6e also develop connections in this
chapter bet%een the problem of classification and seemingly disparate topics such as the
induction of scoring functions from a set of training e$amples! Chapter 2 a summary of the
basic challenges in %eb search, together %ith a set of techniues that are pervasive in %eb
information retrieval! 'e$t, Chapter 3D describes the architecture and reuirements of a basic
%eb cra%ler! +inally, Chapter 32 considers the po%er of lin analysis in %eb search, using in
the process several methods from linear algebra and advanced probability theory!
0enetic lgorithms in earch, #ptimi&ation O ?achine >earning 5JD7!; E. G/<;
%ith fore%ord by '/, H/<<; This te$t introduces the theory, operation, and application of
genetic algorithms- search algorithms based on the mechanics of natural selection and
genetics!
H. C,,FJ Information retrieval using probabilistic techniues has attracted significant
attention on the part of researchers in information and computer science over the past fe%
decades! In the 2Ds no%ledge-based techniues also made an impressive contribution to
PPintelligentQQ information retrieval and inde$ing! ?ore recently, information science
researchers have turned to other ne%er artificial-intelligence based inductive learning
techniues including neural net%ors, symbolic learning, and genetic algorithms! These
ne%er techniues, %hich are grounded on diverse paradigms, have provided great
opportunities for researchers to enhance the information processing and retrieval capabilities
of current information storage and retrieval systems!
This article provides an overvie% of these ne%er techniues and their use in information
science research! The three popular methods the connectionist @opfield net%or, the
15
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 16/68
symbolic I15AI14, and evolution-based genetic algorithms! 1iscuss their no%ledge
representations and algorithms in the conte$t of information retrieval!! These techniues are
robust in their ability to analy&e user ueries, identify usersQ information needs, and suggest
alternatives for search! 6ith proper user-system interactions, these methods can greatly
complement the prevailing full-te$t, ey%ord-based, probabilistic, and no%ledge-based
techniues!
A,=; A. A. R;>, B, A. A;< L? , A;< M!; A. A<!, and O+= A. S; ,
4J This study investigates the use of genetic algorithms in information retrieval! The method
is sho%n to be applicable to three %ell-no%n documents collections, %here more relevant
documents are presented to users in the genetic modification! This paper presents a ne%
fitness function for appro$imate information retrieval %hich is very fast and very fle$ible,
than cosine similarity fitness function!
E= A< M+,@ F+ A< M+, ; M/,==; O,= N++ MJ In
information retrieval research; 0enetic lgorithms (0) can be used to find global solutions
in many difficult problems! The study used different similarity measures (1ice, Inner
7roduct) in the 9?, for each similarity measure %e compared ten different 0 approaches
based on different fitness functions, different mutations and different crossover strategies to
find the best strategy and fitness function that can be used %hen the data collection is the
rabic language! The results sho%s that the 0 approach %hich uses one-point
crossover operator, point mutation and Inner 7roduct similarity as a fitness function is the
best I system in 9?!
C!+! L/ 0:-P<@ V!8 P. G/-B/, F <! ; M/-A/NJ recently
there have been appearing ne% applications of genetic algorithms to information retrieval,
most of them speciRcally to relevance feedbac! The evolution of the possible solutions are
guided by Rtness functions that are designed as measures of the goodness of the solutions!
These functions are naturally the ey to achieving a reasonable improvement, and %hich
function is chosen most distinguishes one e$periment from another! In previous %or, theyfound that, among the functions implemented in the literature, the ones that yield the best
16
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 17/68
results are those that tae into account not only %hen documents are retrieved, but also the
order in %hich they are retrieved! @ere, therefore evaluate the efRcacy of a genetic algorithm
%ith various order-based Rtness functions for relevance feedbac (some of them of our o%n
design), and compare the results %ith the Ide dechi method, one of the best traditional
methods!
A.S.S!7 S,@ B.P,!</=! S!=/, J s information has been increasing enormously
in the %orld, it is difficult to retrieve the proper information as per the user satisfaction! In
this %or, document cra%ler is used for gathering and e$tracting information from the
documents available from online databases and other databases! ince search space is too
large, 0enetic lgorithm (0) is used to find out the combination terms! In the proposed
document retrieval system, %e e$tract the ey%ords from the document cra%ler and %ith
these ey%ords 0 generates combination terms! The proposed %or is having three main
features +irst is to e$tract ey%ords and other information from the database by a document
cra%ler! econd is to generate the combination terms using genetic algorithm! Third, results
generated from the 0 are applied to information retrieval system to generate better results!
+rom the results obtained, the relevance of the documents are verified using evaluation
measures namely precision and recall!
D<! L8,7,A;<=!; A. A<, J this paper presents an adaptive method using
genetic algorithm to modify user*s ueries, based on relevance :udgments! This algorithm
%as adapted for the three %ell-no%n documents collections (CII, '>7 and CC?)! The
method is sho%n to be applicable to large te$t collections, %here more relevant documents
are presented to users in the genetic modification! The algorithm sho%s the effects of
applying 0 to improve the effectiveness of ueries in I systems! +urther studies are
planned to ad:ust the system parameters to improve its effectiveness! The goal is to retrieve
most relevantdocuments %ith less number of non-relevant documents %ith respect to userQs
uery in information retrieval system using genetic algorithm!
P,!</=! S!=/, 2DJ etrieval of relevant documents from a large document collection isa challenging tas! 1ocument etrieval is concerned %ith inde$ing and retrieving documents
1(
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 18/68
provided in a document collection! 1ocuments are represented by document descriptors
%hich are defined as terms or ey%ords e$tracted from the te$tual documents! +ormulating
an optimal uery %ith a set of document descriptors involves searching a huge search space
for the better permutation and combination of terms! s 0enetic lgorithm is %ell suited for
searching huge search spaces, in this paper, a t%o stage method is proposed for efficient
information retrieval system using genetic algorithm! 0enetic lgorithm generates the best
combination terms from a set of the document descriptors!
A;<=!; A.A<,22J imilar to 0enetic algorithm, Evolution strategy is a process of
continuous reproduction, trial and selection! Each ne% generation is an improvement on the
one that %ent before! This paper presents t%o different proposals based on the vector space
model (9?) as a traditional model in information etrieval (TI)! The first uses evolution
strategy (E)! The second uses the document centroid (1C) in uery e$pansion techniue!
Then the results are compared; it %as noticed that E techniue is more efficient than the
other methods!
A;> T/=, 23J Current approaches to information retrieval rely on the creativity of
individuals to develop ne% algorithms! This investigation uses the genetic algorithms (0)
and genetic programming (07) to learn I algorithms and e$amined! 1ocument structure
%eighting is a techniue %hereby different parts of a document (title, abstract, etc!) contribute
unevenly to the overall document %eight during raning! 'ear optimal %eights can be
learned %ith a 0! 1oing so sho%s a statistically significant 4S relative improvement in
?7 for vector space inner product and Croft*s probabilistic raning, but no improvement
for B?34! T%o applications of this approach are suggested offline learning, and relevance
feedbac! In a second set of e$periments, a ne% raning function %as learned using 07! This
ne% function yields a statistically significant 22S relative improvement on unseen ueries
tested on the training documents! 7ortability tests to different collections (not used in
training) demonstrate the performance of the ne% function e$ceeds vector space and
probability, and slightly e$ceeds B?34! >earning %eights for this ne% function is proposed!
The application of genetic learning to stemming and thesaurus construction is discussed!
temming rules such as those of the 7orter algorithm are candidates for 07 learning %hereas
synonym sets are candidates for 0 learning!
1)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 19/68
D. V!/, 25J 0enetic algorithms (0s) search for good solutions to a problem by
operations inspired from the natural selection of living beings! mong their many uses, %e
can count information retrieval (I)! In this field, the aim of the 0 is to help an I system to
find, in a huge documents te$t collection, a good reply to a uery e$pressed by the user! The
analysis of phenomena seen during the implementation of a 0 for I has brought us to a
ne% crossover operation! This article introduces this ne% operation and compares it %ith
other learning methods!
The goal of this article is to introduce a ne% crossover operator for the 0 used in I! The
analysis presented in the third section sho%s the origin of the ne% operator, and the results,
compared to the classical 0, indicate that the crossover operator can be improved!
comparison bet%een our application of the 0 and the method of the relevance feedbac
sho%s that, even if the 0 is less efficient than more direct methods, it still has its advantages
and %ill probably continue to be studied in the future!
I<=!/ R. S!<7@ '/ N+ S/:@ ! S. S/+@ 2FJ The vector space model is a
mathematical-based model that represents terms, documents and ueries by vectors and
provides a raning! In this model, the subspace of interest is formed by a set of pair %ise
orthogonal term vectors, indicating that terms are mutually independent! @o%ever, this is a
simplification that doesnQt correspond to the reality! Based on this scenery, in this %or, an
e$tension to the vector space model to tae into account the correlation bet%een terms! In the
proposed model, term vectors are rotated in space geometrically reflecting the dependence
semantics among terms! 6e rotate terms based on a data mining techniue called association
rules! The retrieval effectiveness of the proposed model is evaluated and the results sho%s
that our model improves in average precision, relative to the standard vector space model, for
all collections evaluated, leading to a gain up to 52S!
H!- @ $-? @ G!-? T@ X!/-:,/ F @ X!/- H@ 24JThis paper
brought for%ard a ind of arithmetic of information retrieval, namely combining the positive
genes of 0enetic lgorithm and 9ector pace ?odel on the base of nature language! 0enetic
lgorithm is used for a predication case-frame of uery in this system! Based on @o% 'etthis algorithm gains inherent character tics of data ob:ects, and retrieve the useful information
1*
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 20/68
according to those characteristics! o the system implements information retrieval from
hierarchical no%ledge of concepts! This paper also introduces the ey technology of
information retrieval! ccording to the research on the algorithm model, %e design an I
system in financial domain! It is the most effective particularly in uestion-ans%ering system!
In addition, it can be e$tended to the other domain! It considerably enhances the intelligence
degree of information retrieval by 8 algorithm!
S,/:! L!. C,? ),/@ H/> C,@ 2MJ a mass of distributed and dynamic
information on the 6eb has resulted in .information overload/! 6ith the flood of
information, it has become an important research issue to search the 6eb based on traditional
information retrieval technology! @o%ever, various systems and ambiguous terminology of
information retrieval on the 6eb bring much trouble to users in application and researchers in
development as %ell! This paper proposes the same interface of 6eb document retrieval to
users, it is the model based on multi-agent! Each document in the documents base or from
6eb is represented as a vector in the vector space of classable sememes! The uery from user
is also represented as a vector! The relevance bet%een them can be measured by using the
cosine angle bet%een the uery and its nearest neighbors in the vector space! E$periments
have been done and their results sho%n that this scheme yield good results!
A=! +,+@ =!< D!=!<!<@ 2NJ this %or designed an information retrieval system
"7I" - 7recision Information etrieval ystem - based on the modified vector space model
introducing a ne% uery %eighting formula and similarity function! These modifications of
the classical vector space model aimed to improve the average precision level of the system!
T%o %ell-no%n I parameters, precision and recall %ere used to compute the performance
of system!
L/!+ S. W@ 2JThe 9ector pace ?odel is one of the most common information
retrieval (I) methods for te$t document search! The cosine of the angle or the Euclidean
distance bet%een the uery vector and each document vector is commonly used to measure
similarity for uery matching! Even though the vector space model starts %ith a term-by-document matri$, it inevitably loses the information of relations bet%een uery terms in the
2+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 21/68
document in the first place! This paper presents a modified vector space model for measuring
similarity bet%een the uery and the document %hen responding to a multi-term uery! ?ore
%eight is assigned to the ey%ords based on the ad:acency bet%een the terms in the
documents! Thus, %hen a document contains the ad:acency terms, its vector %ill typically
move closer to the uery vector to sho% stronger relevancy bet%een uery and the document!
21
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 22/68
CHAPTER III
RESEARCH METHODOLOG$
3.1 UER$
#ur goal is to develop a system to address the ad hoc retrieval tas! This is the most standard
I tas! In it, a system aims to provide documents from %ithin the collection that are relevant
to an arbitrary user information need, communicated to the system by means of a one-off,
user-initiated uery! n information need is the topic about %hich the user desires to no%
more, and is differentiated from a uery, %hich is %hat the user conveys to the computer in an
attempt to communicate the information need! document is relevant if it is one that the user
perceives as containing information of value %ith respect to their personal information need!
user is interested in a topic lie .pipe line leas/ and %ould lie to find relevant documents
regardless of %hether they precisely use those %ords or e$press the concept %ith other %ords
such as pipeline rupture! To assess the effectiveness of an I system (i!e!, the uality of its
search results), a user %ill usually %ant to no% t%o ey statistics about the system*s returned
results for a uery
7recision 6hat fractions of the returned results are relevant to the information need
ecall 6hat fraction of the relevant documents in the collection %ere returned by the system
3.2 UER$ OPTIMI)ATION
<uery #ptimi&ation means optimi&ing the uery so that it can retrieve more relevant result
and to reduce the number of irrevelent document retrieved!
3.3 UER$ OPTIMI)ATION SEARCH S$STEM
3.3.1 THE VECTOR SPACE MODEL
• 1ocuments and ueries are both are represented as vectors
22
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 23/68
d i=( wi ,1 , wi ,2, … … w i ,t )
• Eachw i , j is a %eight for term : in document i!
• "bag-of-%ords representation"• imilarity of a document vector to a uery vector U cosine of the angle bet%een them
θ
+ig! 5!2 ngle bet%een 1ocument nd
<uery
Cosine imilarity ?easure
Sim( d i , q
) = cos θ
x ∙ y = |x| |y| cos θ =
|d i|∨q∨¿d i ∙ q
¿ =
∑ j
wi , j × wq , j
√∑ j
wi , j2 √∑
j
wq , j2
• Cosine is a normali&ed dot product
• 1ocuments raned by decreasing cosine value
o Sim(d, q) = 1 when d = q
o Sim(d, q) = 0 when d and q share no terms
3.3.2 BUILDING IR S$STEM
The proposed system is based on 9ector pace ?odel (9?) in %hich both documents and
ueries are represented as vectors! +irstly, to determine documents terms, %e used the
follo%ing procedure
E$traction of all the %ords from each document!
•
Elimination of the stop-%ords from a stop-%ord list-
23
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 24/68
• temming the remaining %ords using the porter stemmer that is the most commonly
used stemmer in English!
fter using this procedure, the final number of terms that described all documents of the
collection, %e assigned the %eights by using the follo%ing formula %hich proposed by alton
and Bucley
aij=(0.5+0.5 tf ij
maxtf )× log N
ni
√(0.5+0.5tf ij
maxtf )2
×( log N
ni )2 GGGGG (5!2)
6hereaij is the %eight assigned to the term
t j in document Di ,
tf ij is the number
of times that term t j appears in document
Di ,ni is the number of documents
inde$ed by the termt j and finally, ' is the total number of documents in the database!
+inally, %e normali&e the vectors, dividing them by their Euclidean norm! This is according
to the study of 'oreault etal!, of the best similarity measures %hich maes angle comparisons bet%een vectors!
6e carry out a similar procedure %ith the collection of ueries, thereby obtaining the
normali&ed uery vectors! Then, %e apply the follo%ing steps
• +or each collection, each uery is compared %ith all the documents, using the cosine
similarity measure! This yields a list giving the similarities of each uery %ith all
documents of the collection!• This list is raned in decreasing order of similarity degree!
• ?ae a training data consists of the top 4 document of the list %ith a corresponding
uery!
• utomatically, the ey%ords (terms) are retrieved from the training data and the
terms %hich are used to form a binary uery vector!
• dapt the uery vector using the genetic approach!
3.3.3 THE GENETIC APPROACH
24
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 25/68
#nce significant ey%ords are e$tracted from training data (relevant and irrelevant
documents) including %eights are assigned to the ey%ords! The binary %eights of the
ey%ords are formed as a uery vector! 6e have applied 0 for t%o fitness function to get an
optimal or near optimal uery vector, also %e have compared the result of the t%o 0
approach %ith the classical I ystems %ithout using 0! This %ill be e$plained in the
follo%ing subsections!
I. R0+!/ /? , 8,/=/+/=+
These chromosomes use a binary representation, and are converted to a real representation by
using a random function! 6e %ill have the same number of genes (components) as the uery
and the feedbac documents have terms %ith non-&ero %eights! The set of terms contained in
these documents and the uery is calculated! The si&e of the chromosomes %ill be eual to
the number of terms of that set, %e get the uery vector as a binary representation and
applying the random function to modify the terms %eights to real representation! #ur 0
approach receives an initial population chromosomes corresponding to the top 24 documents
retrieved from classical I %ith respect to that uery!
II. F!++ ?8!/
+itness function is a performance measure or re%ard function, %hich evaluates ho% each
solution, is good! In our %or, %e used t%o 0s %ith t%o different fitness functions (a) the
first 0 system (02) uses a measure of cosine similarity bet%een the uery vector and the
chromosomes of the population as a fitness function, %ith the euation
∑i=1
t
xi ∙ yi
√∑i=1
t
x i
2
∙∑i=1
t
y i
2
GGGGGG!(5!3)
%here xi is the real representation %eight of term i in the chromosome,
y i is the real
representation %eight of that term in the uery vector and t is the total number of terms in
theuery vector as in a given chromosome ! The value of the cosine similarity lies on the
interval D, 2J according to the similarity bet%een a chromosome and the uery!
III. S<8!/
25
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 26/68
s the selection mechanism, the 0 uses VVsimple random sampling**! This consists of
constructing roulette %ith the same number of slots as there are individuals in the population,
and in %hich the si&e of each slot is directly related to the individual*s fitness value! @ence,
the best chromosomes %ill on average achieve more copies, and the %orst fe%er copies!
lso, %e have used the VVelitism** strategy, as a complement to the selection mechanism! fter
generating the ne% population, if the best chromosome of the preceding generation is by
chance absent, the %orst individual of the ne% population is %ithdra%n and replaced by that
chromosome!
IV. O0/+
In our 0 approaches, %e use t%o 0 operators to produce offspring chromosomes, %hich
are
• C/++/7 is the genetic operator that mi$es t%o chromosomes together to form ne%
offspring! Crossover occurs only %ith crossover probability 7c! Chromosomes are not
sub:ected to crossover remain unmodified! The intuition behind crossover is
e$ploration of a ne% solutions and e$ploitation of old solutions! 0as construct a better
solution by mi$ture good characteristic of chromosome together! @igher fitness
chromosome has an opportunity to be selected more than lo%er ones, so good solutional%ays alive to the ne$t generation! 6e use a single point crossover, e$changes the
%eights of sub-vector bet%een t%o chromosomes, %hich are candidate for this
process!
• M!/ is the second operator uses in our 0 systems! ?utation involves the
modification of the gene values of a solution %ith some probability 7m! In accordance
%ith changing some bit values of chromosomes give the different breeds!
Chromosome may be better or poorer than old chromosome! If they are poorer than
old chromosome they are eliminated in selection step! The ob:ective of mutation is
restoring lost and e$ploring variety of data!
26
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 27/68
+ig! 5!3 7roposed rchitecture of uery #ptimi&ation earch ystem
2(
W ! 2 C r a 3 % ! r
In&a% -o-$%a&on
F&n!ss $n.&on
!#or"s
.o%%!.&!"
,!n!&. !s&
.omna&on &!rms
A--%# !n!&. o-!ra&orsSn%! -on& Crosso/!r
Inorma&on
R!&r!/a% S#s&!m
Do.$m!n&
Da&aas!
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 28/68
3.3.4 GENETIC ALGORITHM STEPS
+ig 5!5 steps of 0enetic lgorithm
2)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 29/68
3.4 EVALUATION OF UER$ OPTIMI)ATION SEARCH S$STEM
There are several %ays to measure the uality of <#, such as the system efficiency and
effectiveness, and several sub:ective aspects related to the user satisfaction! Traditionally, the
retrieval effectiveness (usually based on the document relevance %ith respect to the user*s
needs) is the most considered! There are different criteria to measure this aspect, %ith the
precision and the recall being the most used!
7recision ( 7 ) is the rate bet%een the relevant documents retrieved by the I in response to
a uery and the total number of documents retrieved, %hilst ecall ( ) is the rate bet%een
the number of relevant documents retrieved and the total number of relevant documents to the
uery e$isting in the database! The mathematical e$pression of each of them is sho%ed as
follo%s
7 U Number of documents retrieved∧relevant
Total retrieved U
∑d
rd ∙ f d
∑d
f dG!!(5!5)
U Number of documents ret rieved∧relevant
Total relevant ∈collection U
∑d
rd ∙ f d
∑d
rdG!!(5!F)
%ith rd∈ WD, 2 X being the relevance of document d for the user and
f d∈ WD,2 X being
the retrieval of document d in the processing of the current uery! 'otice that both measures
are defined in D,2J, %ith being the optimal value!
The evaluation function herein is the non-interpolated average precision! 6hich is similar to
average precision but %ith the cut-off points euivalent to the training documents! In this
2*
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 30/68
measure function, the documents are simply raned! >et d1 ,
d2 , ! ! !,¿ D∨¿
d¿denote
the sorted documents by decreasing order of the values of the similarity measure function,
%here Y1Y represents the number of training documents! The function r (d) gives therelevance of a document d! It returns 2 if d is relevant, and D other%ise! The non-interpolated
average precision is defined as follo%s
¿ D∨¿1
j
¿ D∨¿r (d i ) ∙∑ j=1
¿
¿
v! "=1
D∑i=1
¿
¿
GGGG!(5!4)
6hen r ( d i ) returns 2, if
d i is relevant and D other%ise %here Y1Y represent the number
of documents!
3+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 31/68
CHAPTER IV
RESULTS AND DISCUSSION
4.1 UER$ OPTIMI)ATION SEARCH S$STEM EXAMPLE
4.1.1 DATABASE contains these documents
12!t$t 13!t$t
15!t$t 1F!t$t
+ig! F!2 documents
1ocuments go under preprocessing process! nd inde$ is built in vector space model
31
S0-m!n& o o%" "ama!" I a r! D!%/!r# o s%/!r arr/!" n s%/!r
&r$.7
S0-m!n& o .oa% arr/!" n a &r$.7S0-m!n& o o%" arr/!" n a
&r$.7
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 32/68
Terms < 12 13 15 1F df 1Adf I1+ 6 6d2 6d3 6d5 6dF
rrived D D 2 2 2 5 FA5U2!55 D!235 D D D!235 D!235 D!235
Coal D D D D 2 2 FA2UF D!MD3D D D D D D!MD3D
1amaged D 2 D D D 2 FA2UF D!MD3D D D!MD3D D D D
1elivery D D 2 D D 2 FA2UF D!MD3D D D D D!MD3D D
+ire D 2 D D D 2 FA2UF D!MD3D D D!MD3D D D D
0old 2 2 D 2 D 3 FA3U3 D!5D2D D!5D2D D!5D2D D D!5D2D D
hipment D 2 D 2 2 5 FA5U2!55 D!235 D D!235 D D!235D D!235
ilver 2 D 2 D D 2 FA2UF D!MD3D D!MD3D D D!MD3D D D
Truc 2 D 2 2 2 5 FA5U2!55 D!235 D!235 D D!235 D!235 D!235
Table F!2 9ector pace Inde$
document vector (1oc) %ith n ey%ords and a uery vector %ith m uery terms can be
represented as
1oc U(term1 ,term2 ,term3 ,………#,termn)
<uery U(qterm1,qterm2 ,qterm3 ,……,qtermm)
6e use binary term vector, so eachtermi (or
qterm j ) is either D or 2!termi is set to
&ero %hentermi is not presented in document and set to one %hen
termi is presented in
document!
+or e$ample, user enters a uery into our system that could retrieve F documents! These
documents are
12 U Wshipment, gold, damaged, fireX
13 U Wdelivery, silver, arrived, trucX
15 U Wshipment, gold, arrived, trucX
1F U Wshipment, coal, arrived, trucX
32
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 33/68
ll ey%ords of these documents can be arranged in the ascending order as
A!7;@ 8/<@ ;=;@ ;<!7@ ?!@ /<;@ +,!0=@ +!<7@ 8
Encode in the chromosome representation as
12 U D D 2 D 2 2 2 D D
13 U 2 D D 2 D D D 2 2
15 U 2 D D D D 2 2 D 2
1F U 2 2 D D D D 2 D 2
< U D D D D D 2 D 2 2
These chromosomes are called initial population that feed into genetic operator process! The
length of chromosome depends on number of ey%ords of documents retrieved from user
uery! +rom our e$ample the length of each chromosome is bits!
Y D1∨¿ U √ 0.60202+0.60202+0.30102+0.12382=√ 0.91144=0.9546
Y D2∨¿ U √ 0.12382
+0.60202
+0.60202
+0.12382
=√ 0.8691=¿ D!533
Y D3∨¿ U √ 0.12382
+0.30102+0.1238
2+0.1238
2=√ 0.3695=0.6078
Y D4∨¿ U √ 0.12382
+0.60202+0.1238
2+0.1238
2=√ 0.6390=0.7993
Y Di∨¿ U √∑i
wi , j
2
GG! (F!2)
Y<Y U √ 0.30102+0.60202+0.12382=√ 0.6843=0.8272
Y<Y U √∑
i
w$ , j
2
GGG (F!3)
33
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 34/68
Compute all dot products (&ero products ignored)
< Z D1
U D!5D2D Z D!5D2D U D!DDMD
<Z D2 U D!MD3D Z D!MD3D K D!235 Z D!235 U D!5NNN
<Z D
3 U D!5D2D Z D!5D2D K D!235 Z D!235 U D!2D4
<Z D4 U D!235 Z D!235 U D!D245
<Z Di U
∑i
w$ , j wi , j GG!! (F!5)
Calculate the similarity value
Cosineθ D1 U
|$|∗¿ D1∨¿$∗ D1
¿ U
0.09060
0.8272∗0.9546 U D!2DF4
Cosineθ D2 U
|$|∗¿ D2∨¿$∗ D2
¿ U
0.3777
0.8272∗0.9322 U D!F34M
Cosineθ D
3 U
|$|∗¿ D3∨¿$∗ D3
¿ U
0.1059
0.8272∗0.6078 U D!DNN
Cosineθ D
4 U
|$|∗¿ D4∨¿$∗ D4
¿ U
0.0153
0.8272∗0.7993 U D!D2FN
34
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 35/68
Cosineθ Di U im(<,
Di )
im(<, Di ) U
∑i
w $, j wi , j
√∑ j
w$ , j2 √∑
i
wi , j2
4.1.2 FITNESS EVALUATION
+itness function is a performance measure or re%ard function %hich evaluate ho% good each
solution is! The information retrieval problem is ho% to retrieve user reuired documents! It
seems that %e could use the fitness function (F!F) to calculate the distance bet%een
document and uery!
Cosine θ Di U im (<, Di )
im (<, Di ) U
∑i
w$ , j wi , j
√∑ j
w$ , j2 √∑
i
wi , j2 G (F!F)
esult from these fitness functions are interval D to 2! By 2!D means document and uery is
sameness! 9alues near 2!D mean documents and uery are more relevant and values near D!D
mean documents and uery are less relevant! 9alues evaluate from fitness functions are called
.fitness/!
4.1.3 SELECTION
35
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 36/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 37/68
chromosomes, give the different breeds! Chromosomes may be better or poorer than old
chromosomes! If they are poorer than old chromosomes, they are eliminated in selection step!
The ob:ective of mutation is restoring lost and e$ploring variety of data! +or e$ample
randomly mutate chromosome at position M!
D D D 2 D ( D 2 2
esult D D D 2 D 1 D 2 2
4.2 PROCESS OF OUR S$STEM
2! 8ser enters uery into our system!
3! ?atch ey%ords from user uery %ith list of ey%ords
5! Encode documents retrieved by user uery to chromosomes (initial population)
F! 7opulation feed into genetic operator process such as selection, crossover, and
mutation!
4! 1o step F until ma$ generation is reached! 6e %ill get an optimi&e uery chromosome for
document retrieval!
M! 1ecode optimi&e uery chromosome to uery and retrieve document from database!
4.3 TEST CASE FORMULATION
This e$perimentation tests for ueries %ith fitness function cosine coefficient! +itness
function tests %ith set of parameters probability of crossover (7c U D!), and probability of
mutation (7m UD!D2, D!2D, D!5D) to compare the efficiency of retrieval system The
information retrieval efficiency measures from precision 7, recall , test accuracy +2!
7 U
¿ relevant documents∨¿
|relevant documents|%∨documents retrieved∨¿¿
¿ G! (F!4)
3(
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 38/68
U
¿documents &etrieved∨¿
|relevant documents|%∨documents retrieved∨¿¿
¿ GG!!! (F!M)
+2 U2 "&
"+ & GGG (F!N)
esult 7ercentage
Total 1ocs Techniue 7 +2
2DD 6ithout 0 2DDS 2DDS 2DDS
6ith 0 2DDS 2DDS 2DDS
3DD 6ithout 0 N!3S 42!D3S MN!DS
6ith 0 M!DDS M!24S M!DNS
5DD 6ithout 0 3!2MS 43!DFS MM!43S
6ith 0 4!N2S N!4DS M!MDS
FDD 6ithout 0 3!2MS 43!DFS MM!4346ith 0 4!N2S N!4D M!MDS
4DD 6ithout 0 3!2MS 43!DFS MM!43S
6ith 0 4!N2S N!4DS M!MD
∑ ¿5 W!,/ GA &4.%%J 61.43J #3.33J
W!, GA &(.63J &1.#3J &1.1#J
Table F!3 value of precision, recall and +2!
3)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 39/68
Pr!.son R!.a%% F1
+8++9
1+8++9
2+8++9
3+8++9
4+8++9
5+8++9
6+8++9
(+8++9
)+8++9
*+8++9
1++8++9
W&0o$& ,A
W&0 ,A
+ig! F!3 The average percentage result for 7, and +2!
+ig! F!3 sho%s the average result of precision, recall and +2 for e-business topics! +rom the
results, the precision of 0 (D!M5S) are lo%er than the precision, 7 %ithout 0 (F!S)! It
means only some of the documents that are relevant to the user search! @o%ever, the recall,
result %ith 0 is 2!N5S compared to M2!F5S %ithout 0! It means that 2!N5S of the
documents are successfully search by the system based on the uery selected by the user! The
+2%ith 0 (2!2NS) are also higher than the result %ithout 0 (N5!55S)! +rom this result,
%e believed that the searched document based on the 0 have higher accuracy rate rather
than the result %ithout 0!
3*
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 40/68
1++ 2++ 3++ 4++ 5++(58++9
)+8++9
)58++9
*+8++9
*58++9
1++8++9
1+58++9
&0o$& ,A
&0 ,A
+ig F!5 7recision
1++ 2++ 3++ 4++ 5++
+8++9
2+8++9
4+8++9
6+8++9
)+8++9
1++8++9
12+8++9
&0o$& ,A
&0 ,A
+ig F!F ecall
4+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 41/68
1++ 2++ 3++ 4++ 5++
+8++9
2+8++9
4+8++9
6+8++9
)+8++9
1++8++9
12+8++9
&0o$& ,A
&0 ,A
+ig F!4 +2
s sho%n in fig! F!5, %e found that the precision values %ith 0 and %ithout 0 are
decreased %hen total number of documents increased! This is caused by ey%ord e$pansionfrom 0 process that maing the result after that is not accurate to user search but relevant
by the system search! @o%ever the recall and +2 value %ith 0 as sho%n in fig! F!F and fig
F!4, is higher than recall and +2 value %ithout 0!
41
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 42/68
4.4 SCREEN SHORTS
42
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 43/68
+ig! F!M %eb Cra%ler
43
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 44/68
+ig F!N inde$ing in vector space model and searching interface
44
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 45/68
CHAPTER V
CONCLUSION
6e have used cra%ler here, %hich retrieve documents from E-business %eb pages! These %eb
pages go under information retrieval process!
nd finally the proposed <uery #ptimi&ation earch ystem is a t%o stage approach
• +irst uses genetic algorithm to obtain the set of best combination of terms in the first
stage!
• econd stage uses the output %hich is obtained from the first stage to retrieve more
relevant results!
Thus a novel t%o stage approach to document retrieval using 0enetic lgorithm has been
proposed! The proposed information retrieval system is more efficient %ithin a specific
domain as it retrieves more relevant results! This has been verified using the evaluation
measures, precision and recall!
45
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 46/68
CHAPTER VI
FUTURE SCOPE OF WOR
+uture scope of the %or
6e found that by using genetic algorithm, the searching process of the e-business %ebsite is
optimi&ed!
+urthermore, %e can use a feedbac mechanism to the search system the user*s suggestions
about the found documents, %hich leads to a ne% uery using a genetic algorithm! In the ne%
search stage, more relevant documents are given to the user!
The future research plan is to improve the performance of the user search activities such that
user profiles can be learned automatically!
6e believe that the e$perimental results are interesting and useful for related research and
that the research issue identified should be further studied in other collaborative environments
uch as search engines, and the search system in personal computers!
46
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 47/68
REFERENCES
K1J Christopher 1! ?anning, 7rabhaar aghavan and @inrich ch[t&e, Introduction to
Information etrieval, Cambridge 8niversity 7ress! 3DD!httpAAinformationretrieval!orgA
K20enetic lgorithms in earch #ptimi&ation and ?achine >earning! 1avid E! 0oldberg!
D2A2; 7ublisher ddison-6esley!
K3@! Chen, .?achine learning for information retrieval neural net%ors, symbolic learning,
and genetic algorithms/! Hournal of the merican ociety for Information cience, FM(5),
24, pp! 2F=32M!
K4hmed ! ! ad%an, Bahgat ! bdel >atef, bdel ?geid ! li, and #sman ! ade,
.8sing 0enetic lgorithm to Improve Information etrieval ystems/ 6orld cademy of
cience, Engineering and Technology 2N 3DD
K5 Eman l ?ashagba, +eras l ?ashagba and ?ohammad #thman 'assar .<uery
#ptimi&ation 8sing 0enetic lgorithms in the 9ector pace ?odel/ IHCI International
Hournal of Computer cience Issues, 9ol! , Issue 4, 'o 5, eptember 3D22I' (#nline)
2MF-D2F %%%!IHCI!org
K6 Cristina >o\ pe&-7u:alte, 9icente 7! 0uerrero-Bote, +e\ li$ de ?oya-nego\n .#rder
Based +itness +unctions for 0enetic lgorithms pplied to elevance +eedbac/! Hournal of
the merican ociety for Information cience and Technology, 4F(3)243=2MD, 3DD5
K# !!iva athya, B!7hilomina imon, . 1ocument etrieval ystem %ith Combination
Terms 8sing 0enetic lgorithm/! International Hournal of Computer and Electrical
Engineering, 9ol! 3, 'o! 2, +ebruary, 3D2D2N5-2M5!
K%1etelin >uchev, bdelmgeid ! ly,.pplying 0enetic lgorithm in uery improvement
problem/ International Hournal "Information Technologies and ]no%ledge" 9ol!2 A 3DDN!
K&7hilomina imon, .T%o tage pproach to 1ocument etrieval using 0enetic
lgorithm/! International Hournal of ecent Trends in Engineering, 9ol! 2, 'o! 2, ?ay 3DD
K1( bdelmgeid !ly, .Enhancing Information etrieval by using Evolution
trategies /,Information theories and applications 9ol! 24, 5M-5NM, 3DD‖
4(
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 48/68
K11ndre% T!, "an rtificial Intelligence pproach to Information etrieval",
Information 7rocessing and ?anagement, FD(F)M2-M53, 3DDF!
K121! 9ra:itoru, .Crossover improvement for the genetic algorithm in information retrieval/,
Information 7rocessingO ?anagement, 5F(F), pp! FD4=F24, 2!
K13 Ilm^rio ! ilva, Ho_o 'unes ou&a, ]arina ! antos, .1ependence among Terms in
9ector pace ?ode/! 7roceedings of the International 1atabase Engineering and
pplications ymposium (I1E*DF) 2D-DMADF ` 3DDF IEEE
K14@ai-yan ]ang, an-fang , 0ui-fa Teng, iao-&hong +an , iao-yang @e, .esearch on
'atural >anguage I ystem based on 0enetic lgorithm and 9?/, 7roceedings of the
Third International Conference on ?achine >earning and Cybernetics, hanghai, 3M-3
ugust 3DDF!
K15 hao&i >i! Changfe Lhou, @uo%ang Chen, .6eb 1ocument etrieval Based on ?ulti-
agent/, the th International Conference on Computer upported Cooperative 6or in
1esign 7roceedings!
K16mir ]arshenas, ]amil 1imililer, .7I n Information etrieval ystem based on the
9ector pace ?odel/, N-2-F3FF-32-MADA 3DD IEEE!
K1#J >ouis ! 6ang, .elevance 6eighting of ?ulti-Term <ueries for 9ector pace ?odel/,
N-2-F3FF-3NM4-ADA`3DD IEEE!
4)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 49/68
APPENDIX
6eb Cra%ler %ritten in Hava
8sage +rom command line
:ava 6ebCra%ler 8> 'J
6here 8> is the url to start the cra%l, and ' (optional) is the ma$imum number of pages to
do%nload!
import :ava!te$t!Z;
import :ava!util!Z;
import :ava!net!Z;
import :ava!io!Z;
public class 6ebCra%ler W
public static final int EC@>I?IT U 3D; AA bsolute ma$ pages
public static final boolean 1EB80 U false;
public static final tring 1I>>#6 U "1isallo%";
public static final int ?ILE U 3DDDD; AA ?a$ si&e of file
AA 8>s to be searched
9ector ne%8>s;
AA ]no%n 8>s
@ashtable no%n 8>s;
4*
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 50/68
AA ma$ number of pages to do%nload
Int ma$ 7ages;
AA initiali&es data structures! argv is the command line arguments!
public void initiali&e(tringJ argv) W
8> url;
]no%n 8>s U ne% @ashtable();
ne%8>s U ne% 9ector();
try W url U ne% 8>(argvDJ); X
catch (?alformed8>E$ception e) W
ystem!out!println("Invalid starting 8> " K argvDJ);
return;
X
no%n8>s!put(url,ne% Integer(2));
ne%8>s!addElement(url);
ystem!out!println("tarting search Initial 8> " K url!totring());
?a$ 7ages U EC@>I?IT;
if (argv!length 2) W
int i7ages U Integer!parseInt(argv2J);
if (i7agesma$7ages) ma$7ages U i7ages; X
ystem!out!println("?a$imum number of pages" K ma$7ages);
AZBehind a fire%all set your pro$y and port hereZA
5+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 51/68
7roperties propsU ne% 7roperties(ystem!get7roperties());
props!put("http!pro$yet", "true");
props!put("http!pro$y@ost", "%ebcache-cup");
props!put("http!pro$y7ort", "DD");
7roperties ne%props U ne% 7roperties(props);
ystem!set7roperties(ne%props);
X
AA Chec that the robot e$clusion protocol does not disallo%
AA do%nloading url!
publicbooleanrobotafe(8> url) W
tring str@ost U url!get@ost();
AA form 8> of the robots!t$t file
tring strobot U "httpAA" K str@ost K "Arobots!t$t";
8> urlobot;
try W urlobot U ne% 8>(strobot);
X catch (?alformed8>E$ception e) W
AA something %eird is happening, so donQt trust it
return false;
X
if (1EB80) ystem!out!println("Checing robot protocol " K urlobot!totring());
51
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 52/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 53/68
int inde$ U D;
%hile ((inde$ U strCommands!inde$#f(1I>>#6, inde$)) U -2) W
inde$ KU 1I>>#6!length();
tring str7ath U strCommands!substring(inde$);
tringToeni&erst U ne% tringToeni&er(str7ath);
if (st!has?oreToens())
brea;
tring strBad7ath U st!ne$tToen();
AA if the 8> starts %ith a disallo%ed path, it is not safe
if (str8>!inde$#f(strBad7ath) UU D)
return false;
X
return true;
X
AA adds ne% 8> to the ueue! ccept only ne% 8>Qs that end in
AA html or html! old8> is the conte$t, ne%8>tring is the lin
AA (either an absolute or a relative 8>)!
public void addne%url(8> old8>, tring ne%8rltring)
W 8>url;
if (1EB80) ystem!out!println("8> tring " K ne%8rltring);
try W url U ne% 8>(old8>,ne%8rltring);
if (no%n8>s!contains]ey(url)) W
tring filename U url!get+ile();
53
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 54/68
intiuffi$ U filename!lastInde$#f("htm");
if ((iuffi$ UU filename!length() - 5) YY
(iuffi$ UU filename!length() - F)) W
no%n8>s!put(url,ne% Integer(2));
ne%8>s!addElement(url);
ystem!out!println("+ound ne% 8> " K url!totring());
X X X
catch (?alformed8>E$ception e) W return; X
X
AA 1o%nload contents of 8>
public tring getpage(8> url)
W try W
AA try opening the 8>
8>ConnectionurlConnection U url!openConnection();
ystem!out!println("1o%nloading " K url!totring());
urlConnection!setllo%8serInteraction(false);
Inputtreamurltream U url!opentream();
AA search the input stream for lins
AA first, read in the entire 8>
byte bJ U ne% byte2DDDJ;
intnumead U urltream!read(b);
tring content U ne% tring(b, D, numead);
%hile ((numead U -2) OO (content!length() ?ILE)) W
54
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 55/68
numead U urltream!read(b);
if (numead U -2) W
tring ne%Content U ne% tring(b, D, numead);
content KU ne%Content;
X
X return content;
X catch (I#E$ception e) W
ystem!out!println("E# couldnQt open 8> ");
return "";
X
X
AA 0o through page finding lins to 8>s! lin is signalled
AA by a hrefU" !!! It ends %ith a close angle bracet, preceded
AA by a close uote, possibly preceded by a hatch mar (maring a
AA fragment, an internal page marer)
public void processpage(8> url, tring page)
W tringlc7age U page!to>o%erCase(); AA 7age in lo%er case
int inde$ U D; AA position in page
intiEndngle, ihref, i8>, iClose<uote, i@atch?ar, iEnd;
%hile ((inde$ U lc7age!inde$#f("a",inde$)) U -2) W
iEndngle U lc7age!inde$#f("",inde$);
ihref U lc7age!inde$#f("href",inde$);
if (ihref U -2) W
55
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 56/68
i8> U lc7age!inde$#f(""", ihref) K 2;
if ((i8> U -2) OO (iEndngle U -2) OO (i8>iEndngle))
W iClose<uote U lc7age!inde$#f(""",i8>);
i@atch?ar U lc7age!inde$#f("", i8>);
if ((iClose<uote U -2) OO (iClose<uoteiEndngle)) W
iEnd U iClose<uote;
if ((i@atch?arU -2) OO (i@atch?ariClose<uote))
iEnd U i@atch?ar;
tring ne%8rltring U page! ubstring (i8>, iEnd);
addne%url(url, ne%8rltring);
X X X
inde$ U iEndngle;
X
X
AA Top-level procedure! ]eep popping a url off ne%8>s, do%nload
AA it, and accumulate ne% 8>s
public void run(tringJ argv)
W initiali&e(argv);
for (inti U D; ima$7ages; iKK) W
8> url U (8>) ne%8>s!elementt (D);
ne%8>s!removeElementt (D);
if (1EB80) ystem!out!println ("earching " K url!totring());
if (robot afe (url)) W
56
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 57/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 58/68
pacage inde$;
import generic!Tuple3;
import:ava!io!+ile;
import:ava!io!+ile'ot+oundE$ception;
import:ava!util!Collection;
import:ava!util!@ash?ap;
import:ava!util!@ashet;
import:ava!util!>ined>ist;
import:ava!util!>ist;
import:ava!util!?ap;
import:ava!util!<ueue;
import:ava!util!et;
import:ava!util!ortedet;
import:ava!util!Treeet;
import:ava!util!9ector;
AZZ
Z The inverted inde$ much more efficiently stores the inde$ed data for later
Z retrieval! This forms the centre of the pro:ect and should be the only class
Z that needs to be instantiated! ll useful functions can be accessed through this
Z class!
5)
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 59/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 60/68
for(inti U D; i threads; iKK)
this! ThreadsiJ U ne% Inde$ Thread (this,i);
X
AZZ Z <uery for a list of relevant files! If a term appears in a uery but not in Z the document
collection it %ill be ignored!Z param s the uery stringZ paramma$results the ma$imum
number of results to return Z return a list of files matching the uery ZA
7ublicortedet<uery esult uery (tring s, intma$results) W
AAince the Inde$ Thread converts every term to lo%er case so shall %e
s!to>o%erCase ();
AACreate a list of files to return
orted et<uery esultrevel U ne% Tree et<uery esult();
AAplit the terms of the uery by the non-%ord regular e$pression class
tringJ split U s!split ("6K");
AAThis implementation considers the <uery document vector a binary vector
AAin other %ords - duplicates are not allo%ed
ettring terms U ne% @ash ettring();
for(tring str split)
terms! add (str);
AA+or every document in the collection calculate the similarity coefficient
for(inti U D; ithis!docTab!si&e (); iKK) W
doublesc U D!D;
+or (tring term terms) W
Term 1ata td U inde$! get(term);
6+
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 61/68
if(td U null)
c KU td!getI1+ () Z td!getI1+ () Z td!get+re (i);
X
+ile f U this!docTab!get (i)!get +ile ();
if(sc D)
retval!add (ne% <ueryesult (sc,f));
if(retval!si&e() ma$results)
retval!remove (retval!last ());
eturnretval;
X
AZZZ +orces the inde$ to scan all of its inde$ed files! If some files have been
Z modified or removed then ad:ust the inde$ appropriately! ZA
public void force 8pdate() W
AAT#1# Implement
X
AZZZ dds a file to the ueue for inde$ingZ param f the file to be inde$ed
Z thro%s IllegalrgumentE$ception the file is not readable
Z thro%s +ile'ot+oundE$ception the file is not foundZA
7ublic void inde$(+ile f) thro%s IllegalrgumentE$ception, +ile'ot+oundE$ception W
AA?ae sure this file is not part of the collection already
for (1ocument d this!docTab) W
if(d!get+ile ()!euals(f))
return;
61
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 62/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 63/68
for(+ile doc f)
this! Inde$(doc);
X
AZZ Z p+or internal use, notifies completion of inde$ing!ApZ
Z p?ust be synchroni&ed as several threads may access this
Z function concurrently! ince Hava collections ob:ects should
Z not be accessed concurrently synchroni&ed behaviour is reuired!Ap
Z param t the thread that has completed ZA
public synchroni&ed void notifyCompletion(Inde$Thread t) W
AAetrieve the values of this thread
+ile f U t!get+ile();
?aptring,Integerfre U t!getTerm+reuency();
AAdd the document to the document table
try W
this!docTab!add(ne% 1ocument(f));
X catch(E$ception e) W
ystem!err!println("+ileQ" K t!get+ile()!get'ame() K "Q - " K e!get?essage());
e!printtacTrace();
eturn;
X
AA8pdate the collection si&e so that idf can be calculated correctly
+or (Term1ata td this!inde$!values ())
td!set1ocumentCollectioni&e (this!docTab!si&e ());
63
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 64/68
AA+ire off collection si&e listeners
this!fireCollectionChangeEvent();
AAetrieve the document identifier
IntdocId U this!docTab!si&e () - 2;
AA?erge the Terms into the posting list
for (?ap! Entrytring, Integer e fre!entryet()) W
AA0et the term from the inde$ map
Term1ata td U this!inde$!get(e!get]ey());
AAIf the term %as not found create and add it to the inde$
if(td UU null) W
td U ne% Term1ata(this!docTab!si&e());
this!inde$!put (e!get]ey (), td);
this!fireTermChangeEvent ();
AAdd the freuency to the posting list
td!add+reuency (e!get9alue (), docId);
AAet this threads to idle
t!setIdle ();
AAIf the ueue is not empty poll it and start inde$ing again!
AA#ther%ise, chec the first idle pointer and set the thread to idle
If (this!inde$<ueue!si&e () D) W
try W
t!inde$ (inde$<ueue!poll() );
this!fire<ueueChangeEvent();
64
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 65/68
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 66/68
public void removeInde$>istener(Inde$>istener l) W
this!listeners!remove (l);
AZZ Convenience function for firing off collection si&e change events ZA
7rivate void fireCollectionChangeEvent() W
for(Inde$ >istener l this! listeners)
l!si&eChanged (Inde$>istener!s&Types!1#C8?E'TC#>>ECTI#', this!si&e ());
X
AZZ Convenience function for firing off ueue si&e change events ZA
7rivate void fire<ueueChangeEvent () W
+or (Inde$ >istener l this! >isteners)
l!si&eChanged (Inde$>istener!s&Types!+I>E<8E8E, this!inde$<ueue!si&e ());
X
AZZ Convenience function for firing off term si&e change events ZA
7rivate void fireTermChangeEvent () W
+or (Inde$ >istener l this! >isteners)
>!si&e Changed (Inde$>istener!s&Types!TE?C#>>ECTI#', this!inde$!si&e ());
X
AZZ eturns the number of files inde$ed ZA
7ublic int si&e() W
return doc Tab!si&e();
X
AZZ eturns the number of terms globally in the collection ZA
7ublic int termi&e() W
66
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 67/68
returnthis!inde$!si&e();
X
AZZ eturns the number of inde$ing threads this inverted inde$ uses ZA
publicintget'umberThreads() W
returnthis!threads!length;
X
AZZ eturns the current progress of each thread and the file they are inde$ing as a map! ZA
public ?apInteger,Tuple3+loat,+ilegetThread?ap() W
?apInteger,Tuple3+loat,+ile result U ne% @ash?apInteger,Tuple3+loat,+ile();
for(inti U D; ithis!threads!length; iKK) W
Tuple3+loat,+ile p U ne% Tuple3+loat,+ile();
p!first U threadsiJ!progress();
p!second U threadsiJ!get+ile();
result!put(i, p);
X
return result;
X
#verride
public tring totring() W
tringBuilder b U ne% tringBuilder();
b!append("Inde$n");
for(?ap!Entrytring, Term1ata e this!inde$!entryet()) W
6(
8/18/2019 Thesis Final on project search
http://slidepdf.com/reader/full/thesis-final-on-project-search 68/68