midnight in the garden of good and evil search engines

Post on 19-Jan-2016

46 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Midnight in the Garden of Good and Evil Search Engines. Presentation by Richard Wiggins Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww wiggins@msu.edu Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook wiggins@internet.com - PowerPoint PPT Presentation

TRANSCRIPT

Midnight in the Garden of Good and Evil Search Engines

• Presentation by Richard Wiggins

– Technical Advisor, NEM Online, Michigan State University

• www.msu.edu/staff/rww

• wiggins@msu.edu

– Columnist, “Internet Buzz,” webreference.com

• www.webreference.com/outlook

• wiggins@internet.com

– Co-host, Nothing But Net television program (produced by Media One)

A Parable: The Encounter

Between the USS Nimitz and a Canadian

Vessel...

A Frequency Analysis of A Frequency Analysis of the Appearance of a the Appearance of a Critical Search Term Critical Search Term Among Major Search Among Major Search

Engines...Engines...

Frequency of the Search Frequency of the Search Term “Slavko” Among Term “Slavko” Among Major Search IndexesMajor Search Indexes

• AltaVistaAltaVista 54775477

• ExciteExcite 11601160

• InfoseekInfoseek 14521452

• HotbotHotbot 42264226

AltaVista

ExciteInfoseek

Hotbot

Slavko"Celtic Music"0

2000

4000

6000

8000

10000

12000

14000

Slavko

"CelticMusic"

Come Join Our Tour ofCome Join Our Tour of

...a place ...a place millions want to visit...millions want to visit...

……where a cast of characters where a cast of characters stands ready to help you find stands ready to help you find

exactly exactly what you’re looking for...what you’re looking for...

SearchVannah’s Tour Guides

• …a relatively new town

• …only existed since 1993

• With so many visitors, lots of tour guides have set up shop– They tend to have funny names– They compete fiercely– They’re all trying to make money helping

visitors find their way

The Tour Guides• AltaVista

– Fast, lots of memory, knows a lot

– But people complain sometimes results are inconsistent

• InfoSeek

– Claims answers are more relevant

• MetaCrawler

– Doesn’t know anything at all! Just asks the other tour guides!

HotBotHotBot: This HotBot: This

tour guide tour guide wears wears

the the ugliest ugliest clothes!clothes!

The Tour Guides...• Inktomi: other tour guides hire Inktomi to

answer their questions• One guide knows a LOT less than all the

others…– But it’s the most popular by far!– The smarter tour guides think of it as just a dumb

Yahoo…

• But maybe tourists want to know where the B&B is, not a list of all the towels and dishes

Definitions

• Crawler: automated tool to discover new and changed pages, feeds data to…

• Indexer: builds and maintains an index, concordance-style

• Search engine: the actual tool end-users employ when searching

• …but in popular usage, all together = “search engine”

Leveraging 30 Years of Information Retrieval (IR)

• Most new ideas we see in Web engines were thought of long ago...– Stemming– Controlled vocabulary– Text analytics– Knowledge Bases– Personalization (by observing user usage patterns)– Natural language

How Do People How Do People Search?Search?

““Honestly, tourists are the dumbest people” Honestly, tourists are the dumbest people” -- anonymous Tour Guide-- anonymous Tour Guide

What Do People Search For?

• Major search services say people look for... – Sex sites– One’s own name– Friends, colleagues’ Web sites (also by name)– Items in the news– Company / product information– Etc.

Metaspy: Window into Real User Queries

One user view of search.msu.edu: Academics

•application for graduation•overseas study•ordering catalog•School of Music•Computer Science•human ecology department•psychology 101

Another user view of search.msu.edu: Virtual Library

•DNA sequencing•climate change•beam theory•feline brain tumor•PRL and sequencing

Another user view of search.msu.edu: Extension

•livestock pavilion•wildlife fisheries•bathtub removal and installation

•Round Bale Storage

Another user view of search.msu.edu: Conversational

• I would like to know if you offer a workshop on “International Law”

What Do People Search For?Matt Koll’s Formulation

• “finding a needle in a haystack”

• a known needle in a known haystack

• a known needle in an unknown haystack,

• to any needle in a haystack

• Where are the haystacks?

• GenX rendition: Needles? Haystacks? Whatever!

Typical User Search Strategy

• Type in a one-word search term• Maybe two words• Seldom exploit advanced options

– Capitalization– Quoting phrases (e.g. “climate change”)– Date restrictions– Host:, URL: parameters

• Seldom use iterative refinement

Users Make “Wrong” Choices

• Picking the right database is confusing– Reference librarians, experienced users learn brand

names– Inexperienced users do not

• Lycos example: “Small” versus “Large” catalog– “Small” catalog was faster, more precise– Virtually no one used it, thinking “Large” meant

“better”

A Route 128 Story

• Engineering firm on Route 128

• Engineers new products

• Has constant need for specialized information

• Uses traditional sources, and the Web

• “Joe down the hall” does the Internet searches

• Joe is a reference librarian with an engineering degree (and no training in online searching!)

Prospects for Training are Dismal!• We don’t know the users, so we can’t hope to

train them• Users won’t read documentation or help notes• If engine doesn’t deliver, users react viscerally

– “This engine is useless” or– “The Internet has nothing useful”– “The Internet has too much information!”

How Well Do Today’s Engines Meet Real Users’ Needs?

• Most engines cannot yield high precision, high recall hit list with only one search term

• But most users don’t compose or refine their searches carefully

• Boolean operators virtually unused

• Therefore most users probably fail to get desired results

• Many sample searches from MSU example would not yield desired information

AltaVista “Intelligent” Case Matching Example

• Looking for information on “TREC” search engines testing at NIST

Scale IssuesScale Issues

““This town is growing so fast, and there’s This town is growing so fast, and there’s too many tourists!” -- a 3rd generation too many tourists!” -- a 3rd generation

residentresident

The Problem of Scale

• No one knows exact size of Web– Databases, intranets complicate issue– “Dark matter” -- Vint Cerf

• Probably 250 to 500 million pages publicly accessible

• Recent Science article claims most spider coverage is incomplete

• AltaVista claims 140 million pages in index

1 Billion URLs -- and Beyond

1996 1997 19981996 1997 1998

30M30M

140M140M

1000M1000M

Problem of Scale: Transaction Load

• AltaVista handles 30 million searches per day

• Inktomi is “back-end” for numerous sites– HotBot, N2H2 (Japan), Australian news service– Soon, the “find a Web site” function in

Windows 98

• No popular service has melted down yet

Inktomi’s “Network of Workstations” Model

• Eric Brewer, CEO, claims centralized high-speed servers cannot scale

• Developed new clustering scheme: dozens or hundreds of low-cost servers on high-speed network

• But centralized engines have not broken down yet

• 64-bit processors @ 300-450 MHz, gigabytes of RAM, fast paths to disk

TrendsTrends

““We have a forward-looking sense of We have a forward-looking sense of fashion!” -- one of the tour guidesfashion!” -- one of the tour guides

Trends Among Search Engines • Observations of Dr. Susan Feldman, Cornell:

• More professional look, feel than a couple years ago

• Common syntax evolving:– Plus sign prefix for required term, minus for

excluded term

– Quotes signify phrases, caps signify case significant

• Unique “personalities” evolving

The Role of Meta-Crawlers

• Experts agree that spider coverage varies across services

• No two services cover the same sites for a given search

• Therefore searching across multiple indexes yields more results

• Therefore metacrawlers can be useful

Targeted Spiders• Train the spider to crawl only sites that fit a certain

subject domain

• InfoSeek News Index– Death of a Princess example

• Internet.com’s “vertical” index

• LawCrawler

• NEM Online– Research project at Michigan State University

– Harnessing information of use to manufacturers

“death of Princess Diana” Search on Infoseek, 8/31/97 1:00 pm

AskJeeves: Question-oriented Knowledge Base

A Better AskJeeves Question

Northern LightNorthern Light

Traditional Model: First, Pick a Database, Then Do Your Search

Why Northern Light is a Breakthrough

• Delivering quality sources alongside Web resources– As Web becomes more cluttered, advantage grows

• Database search paradigm inverted: First do your search, then pick your source

• Automatic categorization yields manageable hit lists– Advantage also grows as Web grows

Real Name System

Specialized Engines: Serving Specific Geographic Areas

Search for “Intel” on ExciteSearch for “Intel” on Excite

Alexa: Group Alexa: Group ExperienceExperience

Beyond Text: Still Images, Digitized Speech, Video

• We tend to think of search engines as limited to text

• But increasingly we will face digital content• Thanks to scanners, digital cameras, digital

sound cards, digital video cameras• These digital collections will be corporate assets• But to use, and re-purpose, these assets, we will

need search engines

IBM Almaden’s Image Search Software

• Able to index a large collection of still images

• Able to find similar images – User selects image, asks for similar shapes– User draws shapes– User filters by color, textual metadata

• Samples available online:– Searchable digital postage stamp archive

• www.qbic.almaden.ibm.com/cgi-bin/stamps-demo

– Searchable archive of trademarks (logos)

Magnifi: Multimedia Search Engine

AltaVista Keyword Index into Clinton Testimony Video

Cross-Language Searching

• Internet is biased towards English• But it is a World Wide Web• Tools to allow searching in one language,

against a universe in other languages, are evolving

• Challenges of understanding meaning, resolving ambiguities multiply

• But effective tools are coming

The AltaVista Translation Service: Extending Search

Engines into New Areas

• Translates to/from English, Spanish, French, Italian, German

• Try translating “Are you having a bad hair day?” to another language and back...

Translation Result:

• “Are you having a bad hair day?” ...becomes…

• “It is for you defective day of hats, no?”

Any Portal in a Storm

• Search engine services becoming portals

• Non-index services– Browsing view

– Stock quotes

– Pager services

– Personalization (“My Yahoo, My AltaVista, My Foot)

• The linear search engine result set can’t compete without added components

Evaluating the EnginesEvaluating the Engines

““You just can’t trust some of the other tour You just can’t trust some of the other tour guides!” -- guides!” -- every every tour guide tour guide

Evaluating Search Engines• Searchenginewatch.com

– Part of internet.com family– “Search Engine EKG”– Measures rate of crawling, other metrics, fhor leading

Web engines

• National Institute of Standards and Technology TREC Series– Rigorous annual “bakeoff” conducted by Donna Harmon– Leading technology firms, university researchers compete

SearchEngineWatch.com: Search Engine EKG

AltaVista vs Infoseek: An Accidental Bakeoff

• Michigan State University was first university to acquire AltaVista Intranet product (1996)

• Used for campus-wide spider as well as subject-specific index (manufacturing)– search.msu.edu– www.nemonline.org

• Infoseek on its own initiative set up an index of msu.edu

• In many cases, AltaVista and Infoseek return very similar results

• Using actual searches typed by users, in some cases Infoseek shows superior relevancy ranking– Word proximity has more weight

• Infoseek also appears to offer superior duplicate detection

• “Find similar” in Infoseek works very well

AltaVista vs Infoseek: Preliminary Observations

Decentralized Searching: The Infoseek

Experiment• Steve Kirsch (CEO of Infoseek) offers this

experiment:

• “Name a movie by James Cameron”

What This Experiment Shows…

• Some servers are louder than others

• Several servers know recent, highly-publicized information

• Some pieces of information are known only to one server

• Some servers give out wrong information

• Some servers never answer any question

Decentralization Trend

• We’ve tried decentralized indexes with little success– WAIS– Harvest

• But scale of single central indexes may force new attempts

• Infoseek intends major push– Network of “Ultraseek” intranet sites– “Use other people’s servers to do the hard work”

Ethics of EnginesEthics of Engines

““What does ethics have to do with helping What does ethics have to do with helping people find things??!” -- people find things??!” -- every every tour guide tour guide

The Ethics of Search Engines

• Gaining value from freely-available content

• Yahoo, AskJeeves advertise themselves as reference sources

• They make money on answers that others provide for free

• Are they a bibliography, which has always been legitimate?

• Or, thanks to the hyperlink, are they exploiting those who provide the real value?

Ethics: Index Spamming

• People learned to spam the index early on– Overload your page with terms people use in searching– Some sites present a different page to the spider than

the end user sees– One church asked a Web developer to put in meta tags

with obscene words

• Is spamming unethical?– Seems to be, but why exactly?– Sears catalog vs Montgomery Ward

Ethics: The Search Services’ Incentives

• Most make money from banner ads

• They want to maximize page impressions and clickthroughs

• The ideal user would search forever!

• Banner ads adapt to the search based on keyword– Banner ad technology is better than result set

technology!

Ethics: Editorial Copy Versus Advertising-Influenced

• In the print world, it’s pretty obvious what’s an advertisement– Yellow Pages– New York Times– Thomas Register

• To avoid confusion, some ads are labeled

• In the online world, it’s not always clear

• If companies sold better search positions, how would we know?

Ethics: Buy First Place on the Hit List -- and Tell How Much You Paid!

Paying for Position

"We tried the editorial system of rating pages, and we found that it wasn't scalable...but the market is infinitely scalable.”

– Jeffrey Brewer, CEO, GoTo.com

The FutureThe Future

““I don’t know what the future is, but we’ll be I don’t know what the future is, but we’ll be number one!!!” -- number one!!!” -- every every tour guide tour guide

The Future: Promises and Limits

• IR scientists say engines may be approaching fundamental limit

• Koll: typical gigabyte of searchable space holds 25,000 occurrences of typical search term

• “With a lot of work, maybe we can get to 50% recall and 50% precision”

• But combination of approaches can yield greater power

The Search Engine Industry

• Analysts generally agree that “Yahoo wins!”!• Claims ~100 million transactions per day• Also claims 30 million unique users• Also claims more “viewership” per day than

most specialty cable TV channels (e.g. MTV)• And it’s a catalog, not a full-text engine!

Search Engine Companies’ “Value Per User” (Mecklermedia)

VALUE INDEX UsersMarket

Value ofCompany

Value PerUser

(sorted by valueper user)

(millions) (millions)

Yahoo 32.5 $ 5,273 $ 162.38Microsoft.com 18.0 $ 1,850 $ 102.68Excite 19.3 $ 1,488 $ 76.96Lycos –Tripod 15.1 $ 992 $ 65.58Netscape.com 23.4 $ 1,500 $ 64.09Infoseek – WBS 16.2 $ 964 $ 59.50AltaVista 7.5 $ 260 $ 34.75TOTAL $ 179 $ 14,982 $ 730AVERAGE $ 18 $ 1,498 $ 73

Changes and Alliances

• All search sites now offer browsing views

• All services offering free e-mail

• Yahoo offering news

• Alliances– AltaVista plus Real Name System– AltaVista plus Amazon.com– Lycos plus Barnes and Noble online– Yahoo drops AltaVista when AltaVista adds browsing

view

Combining Best Features

• Of Yahoo, Infoseek, AskJeeves

• Build a knowledge base – Leverage the actual queries people issue

– An FAQ

• Offer a blend of drill-down hierarchy, knowledge base, full-text

• Search for one word yields rich result set – E.g. “Intel”

• Example: Verity’s new Knowledge Organizer

Verity’s Knowledge Organizer Product

• A tool to capture and organize an organization’s online information– Build your own Yahoo and AltaVista-style search service

• Site builds its own topical taxonomy– Using a graphical user interface

• Tool indexes within categories and across them

• End user can – drill down within topics– search within and across topics

A Modest Proposal: The Accidental Thesaurus

• For intranet, online product catalog, newspaper, campus sites

• Build a thesaurus based on what people look for• Don’t even try to be comprehensive• Use your search logs to find what people look

for -- and how they actually search• Fuzzy matching of user searches against

thesaurus, a la AskJeeves

New Job Title: The Info Snout

• Like an Info Scout...only nosier

• Similar job as cataloging librarian...more like a pathfinder builder

• Daily routine:– Look at search logs– Find new terms, add to thesaurus– Also look at company newsletters, newspaper,

trade journals, etc

Lack of Structure

• Today’s spiders effectively index every page as a separate document

• What if an OPAC did that?• The atom in a hit list should be a document,

not a page• With XML, one could define structure for

documents• But will we have one definition, or many?

The Future• Much more intelligent engines

• Not much more intelligence in users

• The linear, undifferentiated hit list will die

• Cross-language

• Text, image, sound, video

• The “Star Trek” computer model of searching

A Comment from the PR Person at a Major Internet

Search Service...• “I hope you are aware of our

product, and I hope your remarks will show that our product is one of the good ones, not one of the evil ones…”

• We will not name the company, but its name evokes “aurora borealis”….

Infonortics Search Engines Conference

• Outstanding two day conference with leading search engine experts– From academe and from search industry

• Held April 1 in Boston; two previous conferences

• Scheduled for April 19-20, 1999– Back Bay Hilton, Boston

• See www.infonortics.com

Special Thanks To...

• Judy Matthews, Michigan State University Libraries

• Lou Rosenfeld, Argus Associates

• Sue Davidsen, Michigan Electronic Library

• Julie Long, Advanced Information Consultants

See Related Articles in June 1998 issue of Searcher• “Infonortics '98 Search Engines

Conference” article by Judy Matthews and Rich Wiggins: http://www.infotoday.com/searcher/jun/story4.htm

• Article & chart covering search engine trends by Susan Feldman: http://www.infotoday.com/searcher/jun/story2.htm

These slides will appear...

• www.nemonline.org/present/rww

top related