quad search: a novel metasearch engine (lakritid) leonidas akritidis 1 george voutsakelis 2...

26
Quad Search: A novel metasearch Quad Search: A novel metasearch engine engine (http://cheetah.csd.auth.gr/~lakritid) (http://cheetah.csd.auth.gr/~lakritid) Leonidas Akritidis Leonidas Akritidis 1 George Voutsakelis George Voutsakelis 2 Dimitrios Katsaros Dimitrios Katsaros 1,2 1,2 Panayiotis Bozanis Panayiotis Bozanis 2 1 Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Thessaloniki, Hellas Thessaloniki, Hellas 2 Computer & Communication Engineering Dept., Univ of Thessaly, Computer & Communication Engineering Dept., Univ of Thessaly, Volos, Hellas Volos, Hellas 11 11 th th Panhellenic Conference of Informatics, Patras, Hellas, 18- Panhellenic Conference of Informatics, Patras, Hellas, 18- 20/05/2007 20/05/2007

Upload: blake-obrien

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Quad Search: A novel metasearch Quad Search: A novel metasearch engineengine

(http://cheetah.csd.auth.gr/~lakritid)(http://cheetah.csd.auth.gr/~lakritid)Leonidas AkritidisLeonidas Akritidis11

George VoutsakelisGeorge Voutsakelis22

Dimitrios KatsarosDimitrios Katsaros1,21,2

Panayiotis BozanisPanayiotis Bozanis22

11Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Thessaloniki, Data Engineering Lab, Dept. of Informatics, Aristotle Univ., Thessaloniki, HellasHellas

22Computer & Communication Engineering Dept., Univ of Thessaly, Volos, Computer & Communication Engineering Dept., Univ of Thessaly, Volos, HellasHellas

1111thth Panhellenic Conference of Informatics, Patras, Hellas, 18-20/05/2007 Panhellenic Conference of Informatics, Patras, Hellas, 18-20/05/2007

IntroductiIntroductionon

Single Search EnginesSingle Search Engines

•Maintenance of a document databaseMaintenance of a document database•Low Web CoverageLow Web Coverage•Medium ScalabilityMedium Scalability•Paid ListingsPaid Listings

Metasearch EnginesMetasearch Engines

•Effortless invocation of multiple search enginesEffortless invocation of multiple search engines•No document databaseNo document database•Increased Web CoverageIncreased Web Coverage•Improved retrieval effectivenessImproved retrieval effectiveness

IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

Metasearch EnginesMetasearch Engines

The Metasearch Engines use the document The Metasearch Engines use the document databases that the component search databases that the component search engines maintainengines maintain

UUsseerr

MMeettaasseeaarrcchh EEnnggiinnee

CCoommppoonneenntt EEnnggiinnee 11

CCoommppoonneenntt EEnnggiinnee 22

CCoommppoonneenntt EEnnggiinnee NN

……

DDooccuummeenntt DDaattaabbaassee 11

DDooccuummeenntt DDaattaabbaassee 22

DDooccuummeenntt DDaattaabbaassee NN

IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

Rank AggregationRank Aggregation

What is Rank Aggregation?What is Rank Aggregation?IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

AA BB DD CC FF EE

BB DD CC AA

BB DD CC AA FF EE

RRaannkk AAggggrreeggaattiioonn BB DD CC AA FF EE

Rank Aggregation Rank Aggregation MethodsMethods

Rank Aggregation MethodsRank Aggregation Methods

Unweighted Borda CountUnweighted Borda Count

Spearman’s FootruleSpearman’s Footrule

Kental’s TauKental’s Tau

Markov ChainsMarkov Chains

IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

KE MethodKE Method

DescriptionDescription

Each result is called candidateEach result is called candidate

Each candidate receives a score (weight), Each candidate receives a score (weight), according to the formula below:according to the formula below:

IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

m

i 1n

m

r iw

kn 1

10

•r(i): The candidate’s rank in the i-th enginer(i): The candidate’s rank in the i-th engine•n: The number of the candidate’s appearancesn: The number of the candidate’s appearances•m: The number of the invoked search enginesm: The number of the invoked search engines•k: The length of the top-k listk: The length of the top-k list

Antispam Version of the KE Antispam Version of the KE MethodMethod

We say that a search engine has been We say that a search engine has been spammed by aspammed by a

page when it ranks the page too highly with page when it ranks the page too highly with respect torespect to

the other pages, according to the view of a the other pages, according to the view of a typical usertypical user

We try to constrain this phenomenon by We try to constrain this phenomenon by proposing theproposing the

Antispam version of the KE Method which can Antispam version of the KE Method which can be betterbe better

described by the following pseudocode:described by the following pseudocode:

1.1. Find which items appear in most than half Find which items appear in most than half pages (let the number of these items be c)pages (let the number of these items be c)

2.2. Apply the KE Method for these itemsApply the KE Method for these items3.3. Position them in results’ list, starting at rank Position them in results’ list, starting at rank

114.4. Apply the KE Method for the rest of the itemsApply the KE Method for the rest of the items5.5. Position them in results’ list starting at rank Position them in results’ list starting at rank

c+1c+1

IntroductionIntroduction

Metasearch Metasearch EnginesEngines

Rank Rank AggregationAggregation

Rank Rank Aggregation Aggregation MethodsMethods

KE MethodKE Method

Antispam Antispam VersionVersion

Quad Search’s Quad Search’s ArchitectureArchitecture

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Schematic diagram of Quad Search’s Schematic diagram of Quad Search’s ArchitectureArchitecture

USER

USER INTERFACE

Database Selector

Quad Bot

Object Builder

Classification Module

Presentation Module

Query Terms

Ranking Algorithm

Results Page

User InterfaceUser Interface

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

FeaturesFeatures

Quad Search’s User Interface is friendly and Quad Search’s User Interface is friendly and simple in order to ensure:simple in order to ensure:

•Short download timesShort download times•Compatibility with all major browsersCompatibility with all major browsers•Convenient usageConvenient usage

For this reason, we avoided using:For this reason, we avoided using:

•Large graphics filesLarge graphics files•Javascript and AJAXJavascript and AJAX•Flash PresentationsFlash Presentations

User Interface (Search User Interface (Search Hints)Hints)

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Search HintsSearch Hints

We developed this part of Quad Search to We developed this part of Quad Search to provide:provide:

•Detailed information about all its featuresDetailed information about all its features•Explanation for simple and complex operationsExplanation for simple and complex operations•Many helpful examplesMany helpful examples

Quad Bot (1)Quad Bot (1)

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

DescriptionDescription

Quad Bot is responsible for the result retrieval. Quad Bot is responsible for the result retrieval. It consistsIt consists

of the following sub-modules:of the following sub-modules:

• Input Validator: It performs security checksInput Validator: It performs security checks

• Query Dispatcher: It submits the query to the Query Dispatcher: It submits the query to the

component search engines simultaneouslycomponent search engines simultaneously

• Result Collector: It embraces the engines’ Result Collector: It embraces the engines’

responsesresponses

• Result Validator: It performs multiple Result Validator: It performs multiple

conversions to the collected data.conversions to the collected data.

Quad Bot (2 - Quad Bot (2 - Architecture)Architecture)

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

ArchitectureArchitecture

Parameter Receiver - Validator

Query Dispatcher

Engine 4

Result Collector

Result Validator

OBJ ECT BUILDER

DB SELECTOR - USER

Engine 3 Engine 2 Engine 1

Web Search APIsWeb Search APIs

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

What is a Web Search API?What is a Web Search API?

API stands for Application Programming API stands for Application Programming Interface.Interface.

It is a programming tool supplied by the It is a programming tool supplied by the manufacturer of a large scale applicationmanufacturer of a large scale application

A Web Search API is used to retrieve results A Web Search API is used to retrieve results from major search enginesfrom major search engines

DisadvantagesDisadvantages

• Inaccurate results compared to the “mother” Inaccurate results compared to the “mother” engineengine

• Queries per Day LimitationQueries per Day Limitation• Registration IDs requiredRegistration IDs required• Queries per Registration ID LimitationQueries per Registration ID Limitation

Quad Search Quad Search does notdoes not make use of Search APIs make use of Search APIs

Engine BombingEngine Bombing

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

DefinitionDefinition

Engine Bombing occurs when multiple results Engine Bombing occurs when multiple results from thefrom the

same domain enter the presented results’ listsame domain enter the presented results’ list

Many metasearch engines suffer the engine Many metasearch engines suffer the engine bombing.bombing.

Engine Bombing ProtectionEngine Bombing Protection

Quad Search supports a feature to limit the Quad Search supports a feature to limit the different different

results coming from same domainresults coming from same domain

Results FilteringResults Filtering

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Provided FiltersProvided Filters

• Antispam Filter: Application of the antispam Antispam Filter: Application of the antispam version of the KE Methodversion of the KE Method

• Ranking Algorithm Selector: Quad Search Ranking Algorithm Selector: Quad Search provides an option to determine how the provides an option to determine how the collected results will be rankedcollected results will be ranked

• Engine Bombing ProtectionEngine Bombing Protection

Advanced Web SearchAdvanced Web Search

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Advanced Search FiltersAdvanced Search Filters

• File Type Selector: The user can perform File Type Selector: The user can perform searches for files of specific format (PDF, searches for files of specific format (PDF, DOC, XLS and PPT)DOC, XLS and PPT)

• Language Filter: Quad Search can return Language Filter: Quad Search can return documents written in a specifed languagedocuments written in a specifed language

• Domain Filter: The user can search a given Domain Filter: The user can search a given domain, or exclude a domain from a searchdomain, or exclude a domain from a search

• Date Filter: Return results updated in the past Date Filter: Return results updated in the past 3, 6, or 12 months3, 6, or 12 months

Web Search OptionsWeb Search Options

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Quad Search provides the user with the ability Quad Search provides the user with the ability

to setto set

options that will be used in future searchesoptions that will be used in future searches

Some of these options are:Some of these options are:

1.1. Connection Timeout Feature. How long Quad Connection Timeout Feature. How long Quad

Search Search should wait a search engine to should wait a search engine to

respondrespond

2.2. Determine the number of candidates to be Determine the number of candidates to be

collected per component enginecollected per component engine

3.3. Determine the number of results to be Determine the number of results to be

displayed per result pagedisplayed per result page

4.4. Determine whether the results will be opened Determine whether the results will be opened

in a new browser windowin a new browser window

Results Presentation (1)Results Presentation (1)

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Classic View:Classic View: The results are displayed in the The results are displayed in the classic wayclassic way

Array View:Array View: The results are displayed in a The results are displayed in a ranked array.ranked array.

The user can watch the results and their The user can watch the results and their rankings easierrankings easier

Results Presentation (2)Results Presentation (2)

Existing EnginesExisting EnginesQuad SearchQuad SearchWeb PlatformWeb PlatformArchitectureArchitectureUser InterfaceUser InterfaceQuad BotQuad BotWeb Search APIsWeb Search APIsEngine BombingEngine BombingResults FilteringResults FilteringAdvanced SearchAdvanced SearchSearch OptionsSearch OptionsResult Result PresentationPresentationExtra FeaturesExtra Features

Results PageResults Page

The results page is highly customizable. A The results page is highly customizable. A relativerelative

screenshot is depicted belowscreenshot is depicted below

Scientific SearchScientific Search

Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features

General FeaturesGeneral Features

Quad Search is capable of searching for Quad Search is capable of searching for

scientists,scientists,

authors and/or published articlesauthors and/or published articles

Google Scholar provides the required dataGoogle Scholar provides the required data

Quad Search collects the data and produces Quad Search collects the data and produces

statisticsstatistics

and chartsand charts

H-IndexH-Index

Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features

DefinitionDefinition

The h-index is an index for quantifying the The h-index is an index for quantifying the scientificscientificproductivity of physicists and other scientists productivity of physicists and other scientists based onbased ontheir publication recordtheir publication record

A A scientist has indexscientist has index h h ifif h h of his Nof his Npp papers have papers have at least at least h h citations each,citations each, and the otherand the other (N (Npp - h) - h) papers have no more than h citations eachpapers have no more than h citations each

Quad Search computes h-index when the user Quad Search computes h-index when the user doesdoesa search for authorsa search for authors

Scientific Search OptionsScientific Search Options

Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features

The scientific search part of Quad Search offers The scientific search part of Quad Search offers

a varietya variety

of options that can be stored and used in future of options that can be stored and used in future

searchessearches

The user can defineThe user can define

• The results’ languageThe results’ language• The results’ subject area (biology, chemistry, The results’ subject area (biology, chemistry,

physics, engineering, medicine etc)physics, engineering, medicine etc)• The number of results to be displayed per The number of results to be displayed per

pagepage• If the results will be opened in the current or If the results will be opened in the current or

in a new windowin a new window

Extra Features - ChartsExtra Features - Charts

Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features

The user can visually check the number of cites The user can visually check the number of cites perper

paper of a specified author. This feature is paper of a specified author. This feature is applicableapplicable

for “Author Searches”for “Author Searches”

Extra Features – Excluding Extra Features – Excluding PapersPapers

Scientific SearchScientific SearchRelated WorkRelated WorkH-IndexH-IndexSearch OptionsSearch OptionsAdvanced SearchAdvanced SearchCacheCacheExtra FeaturesExtra Features

When a user performs an “Author Search”, When a user performs an “Author Search”, Quad SearchQuad Search

transfers all results from Google Scholar (or its transfers all results from Google Scholar (or its cache)cache)

Possibly, some of these articles should not Possibly, some of these articles should not participate inparticipate in

the calculations (e.g. the h-index)the calculations (e.g. the h-index)

The user can exclude the papers that should notThe user can exclude the papers that should notparticipate in the calculations, by deselecting participate in the calculations, by deselecting

thetheappropriate checkboxappropriate checkbox

Future WorkFuture Work

Future WorkFuture WorkConcluding Concluding remarksremarks

Our plans for Quad SearchOur plans for Quad Search

• Support for extra ranking algorithms (e.g. Support for extra ranking algorithms (e.g.

Markov chains)Markov chains)

• Geography aware search for NewsGeography aware search for News

• News Search with RSS feedsNews Search with RSS feeds

• Wide Personalization (users, profiles, topics of Wide Personalization (users, profiles, topics of

interest, stored multimedia and user defined interest, stored multimedia and user defined

customization)customization)

• Image and Video searchesImage and Video searches

• Searches in P2P networks (e-donkey, g-Searches in P2P networks (e-donkey, g-

nutella, etc)nutella, etc)

• Torrent SearchesTorrent Searches

Concluding RemarksConcluding Remarks

Future WorkFuture WorkConcluding Concluding remarksremarks

ConclusionsConclusions

• In this session, we presented a pair of rank In this session, we presented a pair of rank aggregation algorithms, KE Method and its aggregation algorithms, KE Method and its antispam versionantispam version

• We injected some new parameters like the We injected some new parameters like the number of the top-k lists that a page appears number of the top-k lists that a page appears and the total number of the exploited search and the total number of the exploited search enginesengines

• We also presented a novel meta-search We also presented a novel meta-search engine, Quad Searchengine, Quad Search

• Quad Search offers a wide variety of new Quad Search offers a wide variety of new features for web search, like the ranking features for web search, like the ranking algorithm selector, the engine bombing algorithm selector, the engine bombing protection etcprotection etc

• Quad Search also provides options for Quad Search also provides options for searches for scientific articles. It also searches for scientific articles. It also computes statistics like h-indexcomputes statistics like h-index