midnight in the garden of good and evil search engines

87
Midnight in the Garden of Good and Evil Search Engines Presentation by Richard Wiggins Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww [email protected] Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook [email protected] Co-host, Nothing But Net television program (produced by Media One)

Upload: jovita

Post on 19-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Midnight in the Garden of Good and Evil Search Engines. Presentation by Richard Wiggins Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww [email protected] Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook [email protected] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Midnight in the Garden  of Good and Evil  Search Engines

Midnight in the Garden of Good and Evil Search Engines

• Presentation by Richard Wiggins

– Technical Advisor, NEM Online, Michigan State University

• www.msu.edu/staff/rww

[email protected]

– Columnist, “Internet Buzz,” webreference.com

• www.webreference.com/outlook

[email protected]

– Co-host, Nothing But Net television program (produced by Media One)

Page 2: Midnight in the Garden  of Good and Evil  Search Engines

A Parable: The Encounter

Between the USS Nimitz and a Canadian

Vessel...

Page 3: Midnight in the Garden  of Good and Evil  Search Engines
Page 4: Midnight in the Garden  of Good and Evil  Search Engines

A Frequency Analysis of A Frequency Analysis of the Appearance of a the Appearance of a Critical Search Term Critical Search Term Among Major Search Among Major Search

Engines...Engines...

Page 5: Midnight in the Garden  of Good and Evil  Search Engines

Frequency of the Search Frequency of the Search Term “Slavko” Among Term “Slavko” Among Major Search IndexesMajor Search Indexes

• AltaVistaAltaVista 54775477

• ExciteExcite 11601160

• InfoseekInfoseek 14521452

• HotbotHotbot 42264226

Page 6: Midnight in the Garden  of Good and Evil  Search Engines

AltaVista

ExciteInfoseek

Hotbot

Slavko"Celtic Music"0

2000

4000

6000

8000

10000

12000

14000

Slavko

"CelticMusic"

Page 7: Midnight in the Garden  of Good and Evil  Search Engines

Come Join Our Tour ofCome Join Our Tour of

...a place ...a place millions want to visit...millions want to visit...

……where a cast of characters where a cast of characters stands ready to help you find stands ready to help you find

exactly exactly what you’re looking for...what you’re looking for...

Page 8: Midnight in the Garden  of Good and Evil  Search Engines

SearchVannah’s Tour Guides

• …a relatively new town

• …only existed since 1993

• With so many visitors, lots of tour guides have set up shop– They tend to have funny names– They compete fiercely– They’re all trying to make money helping

visitors find their way

Page 9: Midnight in the Garden  of Good and Evil  Search Engines

The Tour Guides• AltaVista

– Fast, lots of memory, knows a lot

– But people complain sometimes results are inconsistent

• InfoSeek

– Claims answers are more relevant

• MetaCrawler

– Doesn’t know anything at all! Just asks the other tour guides!

Page 10: Midnight in the Garden  of Good and Evil  Search Engines

HotBotHotBot: This HotBot: This

tour guide tour guide wears wears

the the ugliest ugliest clothes!clothes!

Page 11: Midnight in the Garden  of Good and Evil  Search Engines

The Tour Guides...• Inktomi: other tour guides hire Inktomi to

answer their questions• One guide knows a LOT less than all the

others…– But it’s the most popular by far!– The smarter tour guides think of it as just a dumb

Yahoo…

• But maybe tourists want to know where the B&B is, not a list of all the towels and dishes

Page 12: Midnight in the Garden  of Good and Evil  Search Engines

Definitions

• Crawler: automated tool to discover new and changed pages, feeds data to…

• Indexer: builds and maintains an index, concordance-style

• Search engine: the actual tool end-users employ when searching

• …but in popular usage, all together = “search engine”

Page 13: Midnight in the Garden  of Good and Evil  Search Engines

Leveraging 30 Years of Information Retrieval (IR)

• Most new ideas we see in Web engines were thought of long ago...– Stemming– Controlled vocabulary– Text analytics– Knowledge Bases– Personalization (by observing user usage patterns)– Natural language

Page 14: Midnight in the Garden  of Good and Evil  Search Engines

How Do People How Do People Search?Search?

““Honestly, tourists are the dumbest people” Honestly, tourists are the dumbest people” -- anonymous Tour Guide-- anonymous Tour Guide

Page 15: Midnight in the Garden  of Good and Evil  Search Engines

What Do People Search For?

• Major search services say people look for... – Sex sites– One’s own name– Friends, colleagues’ Web sites (also by name)– Items in the news– Company / product information– Etc.

Page 16: Midnight in the Garden  of Good and Evil  Search Engines

Metaspy: Window into Real User Queries

Page 17: Midnight in the Garden  of Good and Evil  Search Engines
Page 18: Midnight in the Garden  of Good and Evil  Search Engines

One user view of search.msu.edu: Academics

•application for graduation•overseas study•ordering catalog•School of Music•Computer Science•human ecology department•psychology 101

Page 19: Midnight in the Garden  of Good and Evil  Search Engines

Another user view of search.msu.edu: Virtual Library

•DNA sequencing•climate change•beam theory•feline brain tumor•PRL and sequencing

Page 20: Midnight in the Garden  of Good and Evil  Search Engines

Another user view of search.msu.edu: Extension

•livestock pavilion•wildlife fisheries•bathtub removal and installation

•Round Bale Storage

Page 21: Midnight in the Garden  of Good and Evil  Search Engines

Another user view of search.msu.edu: Conversational

• I would like to know if you offer a workshop on “International Law”

Page 22: Midnight in the Garden  of Good and Evil  Search Engines

What Do People Search For?Matt Koll’s Formulation

• “finding a needle in a haystack”

• a known needle in a known haystack

• a known needle in an unknown haystack,

• to any needle in a haystack

• Where are the haystacks?

• GenX rendition: Needles? Haystacks? Whatever!

Page 23: Midnight in the Garden  of Good and Evil  Search Engines

Typical User Search Strategy

• Type in a one-word search term• Maybe two words• Seldom exploit advanced options

– Capitalization– Quoting phrases (e.g. “climate change”)– Date restrictions– Host:, URL: parameters

• Seldom use iterative refinement

Page 24: Midnight in the Garden  of Good and Evil  Search Engines

Users Make “Wrong” Choices

• Picking the right database is confusing– Reference librarians, experienced users learn brand

names– Inexperienced users do not

• Lycos example: “Small” versus “Large” catalog– “Small” catalog was faster, more precise– Virtually no one used it, thinking “Large” meant

“better”

Page 25: Midnight in the Garden  of Good and Evil  Search Engines

A Route 128 Story

• Engineering firm on Route 128

• Engineers new products

• Has constant need for specialized information

• Uses traditional sources, and the Web

• “Joe down the hall” does the Internet searches

• Joe is a reference librarian with an engineering degree (and no training in online searching!)

Page 26: Midnight in the Garden  of Good and Evil  Search Engines

Prospects for Training are Dismal!• We don’t know the users, so we can’t hope to

train them• Users won’t read documentation or help notes• If engine doesn’t deliver, users react viscerally

– “This engine is useless” or– “The Internet has nothing useful”– “The Internet has too much information!”

Page 27: Midnight in the Garden  of Good and Evil  Search Engines

How Well Do Today’s Engines Meet Real Users’ Needs?

• Most engines cannot yield high precision, high recall hit list with only one search term

• But most users don’t compose or refine their searches carefully

• Boolean operators virtually unused

• Therefore most users probably fail to get desired results

• Many sample searches from MSU example would not yield desired information

Page 28: Midnight in the Garden  of Good and Evil  Search Engines

AltaVista “Intelligent” Case Matching Example

• Looking for information on “TREC” search engines testing at NIST

Page 29: Midnight in the Garden  of Good and Evil  Search Engines

Scale IssuesScale Issues

““This town is growing so fast, and there’s This town is growing so fast, and there’s too many tourists!” -- a 3rd generation too many tourists!” -- a 3rd generation

residentresident

Page 30: Midnight in the Garden  of Good and Evil  Search Engines

The Problem of Scale

• No one knows exact size of Web– Databases, intranets complicate issue– “Dark matter” -- Vint Cerf

• Probably 250 to 500 million pages publicly accessible

• Recent Science article claims most spider coverage is incomplete

• AltaVista claims 140 million pages in index

Page 31: Midnight in the Garden  of Good and Evil  Search Engines

1 Billion URLs -- and Beyond

1996 1997 19981996 1997 1998

30M30M

140M140M

1000M1000M

Page 32: Midnight in the Garden  of Good and Evil  Search Engines

Problem of Scale: Transaction Load

• AltaVista handles 30 million searches per day

• Inktomi is “back-end” for numerous sites– HotBot, N2H2 (Japan), Australian news service– Soon, the “find a Web site” function in

Windows 98

• No popular service has melted down yet

Page 33: Midnight in the Garden  of Good and Evil  Search Engines

Inktomi’s “Network of Workstations” Model

• Eric Brewer, CEO, claims centralized high-speed servers cannot scale

• Developed new clustering scheme: dozens or hundreds of low-cost servers on high-speed network

• But centralized engines have not broken down yet

• 64-bit processors @ 300-450 MHz, gigabytes of RAM, fast paths to disk

Page 34: Midnight in the Garden  of Good and Evil  Search Engines

TrendsTrends

““We have a forward-looking sense of We have a forward-looking sense of fashion!” -- one of the tour guidesfashion!” -- one of the tour guides

Page 35: Midnight in the Garden  of Good and Evil  Search Engines

Trends Among Search Engines • Observations of Dr. Susan Feldman, Cornell:

• More professional look, feel than a couple years ago

• Common syntax evolving:– Plus sign prefix for required term, minus for

excluded term

– Quotes signify phrases, caps signify case significant

• Unique “personalities” evolving

Page 36: Midnight in the Garden  of Good and Evil  Search Engines

The Role of Meta-Crawlers

• Experts agree that spider coverage varies across services

• No two services cover the same sites for a given search

• Therefore searching across multiple indexes yields more results

• Therefore metacrawlers can be useful

Page 37: Midnight in the Garden  of Good and Evil  Search Engines

Targeted Spiders• Train the spider to crawl only sites that fit a certain

subject domain

• InfoSeek News Index– Death of a Princess example

• Internet.com’s “vertical” index

• LawCrawler

• NEM Online– Research project at Michigan State University

– Harnessing information of use to manufacturers

Page 38: Midnight in the Garden  of Good and Evil  Search Engines

“death of Princess Diana” Search on Infoseek, 8/31/97 1:00 pm

Page 39: Midnight in the Garden  of Good and Evil  Search Engines

AskJeeves: Question-oriented Knowledge Base

Page 40: Midnight in the Garden  of Good and Evil  Search Engines

A Better AskJeeves Question

Page 41: Midnight in the Garden  of Good and Evil  Search Engines

Northern LightNorthern Light

Page 42: Midnight in the Garden  of Good and Evil  Search Engines

Traditional Model: First, Pick a Database, Then Do Your Search

Page 43: Midnight in the Garden  of Good and Evil  Search Engines

Why Northern Light is a Breakthrough

• Delivering quality sources alongside Web resources– As Web becomes more cluttered, advantage grows

• Database search paradigm inverted: First do your search, then pick your source

• Automatic categorization yields manageable hit lists– Advantage also grows as Web grows

Page 44: Midnight in the Garden  of Good and Evil  Search Engines

Real Name System

Page 45: Midnight in the Garden  of Good and Evil  Search Engines

Specialized Engines: Serving Specific Geographic Areas

Page 46: Midnight in the Garden  of Good and Evil  Search Engines

Search for “Intel” on ExciteSearch for “Intel” on Excite

Page 47: Midnight in the Garden  of Good and Evil  Search Engines

Alexa: Group Alexa: Group ExperienceExperience

Page 48: Midnight in the Garden  of Good and Evil  Search Engines

Beyond Text: Still Images, Digitized Speech, Video

• We tend to think of search engines as limited to text

• But increasingly we will face digital content• Thanks to scanners, digital cameras, digital

sound cards, digital video cameras• These digital collections will be corporate assets• But to use, and re-purpose, these assets, we will

need search engines

Page 49: Midnight in the Garden  of Good and Evil  Search Engines

IBM Almaden’s Image Search Software

• Able to index a large collection of still images

• Able to find similar images – User selects image, asks for similar shapes– User draws shapes– User filters by color, textual metadata

• Samples available online:– Searchable digital postage stamp archive

• www.qbic.almaden.ibm.com/cgi-bin/stamps-demo

– Searchable archive of trademarks (logos)

Page 50: Midnight in the Garden  of Good and Evil  Search Engines
Page 51: Midnight in the Garden  of Good and Evil  Search Engines

Magnifi: Multimedia Search Engine

Page 52: Midnight in the Garden  of Good and Evil  Search Engines

AltaVista Keyword Index into Clinton Testimony Video

Page 53: Midnight in the Garden  of Good and Evil  Search Engines

Cross-Language Searching

• Internet is biased towards English• But it is a World Wide Web• Tools to allow searching in one language,

against a universe in other languages, are evolving

• Challenges of understanding meaning, resolving ambiguities multiply

• But effective tools are coming

Page 54: Midnight in the Garden  of Good and Evil  Search Engines

The AltaVista Translation Service: Extending Search

Engines into New Areas

• Translates to/from English, Spanish, French, Italian, German

• Try translating “Are you having a bad hair day?” to another language and back...

Page 55: Midnight in the Garden  of Good and Evil  Search Engines

Translation Result:

• “Are you having a bad hair day?” ...becomes…

• “It is for you defective day of hats, no?”

Page 56: Midnight in the Garden  of Good and Evil  Search Engines

Any Portal in a Storm

• Search engine services becoming portals

• Non-index services– Browsing view

– Stock quotes

– Pager services

– Personalization (“My Yahoo, My AltaVista, My Foot)

• The linear search engine result set can’t compete without added components

Page 57: Midnight in the Garden  of Good and Evil  Search Engines

Evaluating the EnginesEvaluating the Engines

““You just can’t trust some of the other tour You just can’t trust some of the other tour guides!” -- guides!” -- every every tour guide tour guide

Page 58: Midnight in the Garden  of Good and Evil  Search Engines

Evaluating Search Engines• Searchenginewatch.com

– Part of internet.com family– “Search Engine EKG”– Measures rate of crawling, other metrics, fhor leading

Web engines

• National Institute of Standards and Technology TREC Series– Rigorous annual “bakeoff” conducted by Donna Harmon– Leading technology firms, university researchers compete

Page 59: Midnight in the Garden  of Good and Evil  Search Engines

SearchEngineWatch.com: Search Engine EKG

Page 60: Midnight in the Garden  of Good and Evil  Search Engines

AltaVista vs Infoseek: An Accidental Bakeoff

• Michigan State University was first university to acquire AltaVista Intranet product (1996)

• Used for campus-wide spider as well as subject-specific index (manufacturing)– search.msu.edu– www.nemonline.org

• Infoseek on its own initiative set up an index of msu.edu

Page 61: Midnight in the Garden  of Good and Evil  Search Engines

• In many cases, AltaVista and Infoseek return very similar results

• Using actual searches typed by users, in some cases Infoseek shows superior relevancy ranking– Word proximity has more weight

• Infoseek also appears to offer superior duplicate detection

• “Find similar” in Infoseek works very well

AltaVista vs Infoseek: Preliminary Observations

Page 62: Midnight in the Garden  of Good and Evil  Search Engines

Decentralized Searching: The Infoseek

Experiment• Steve Kirsch (CEO of Infoseek) offers this

experiment:

• “Name a movie by James Cameron”

Page 63: Midnight in the Garden  of Good and Evil  Search Engines

What This Experiment Shows…

• Some servers are louder than others

• Several servers know recent, highly-publicized information

• Some pieces of information are known only to one server

• Some servers give out wrong information

• Some servers never answer any question

Page 64: Midnight in the Garden  of Good and Evil  Search Engines

Decentralization Trend

• We’ve tried decentralized indexes with little success– WAIS– Harvest

• But scale of single central indexes may force new attempts

• Infoseek intends major push– Network of “Ultraseek” intranet sites– “Use other people’s servers to do the hard work”

Page 65: Midnight in the Garden  of Good and Evil  Search Engines

Ethics of EnginesEthics of Engines

““What does ethics have to do with helping What does ethics have to do with helping people find things??!” -- people find things??!” -- every every tour guide tour guide

Page 66: Midnight in the Garden  of Good and Evil  Search Engines

The Ethics of Search Engines

• Gaining value from freely-available content

• Yahoo, AskJeeves advertise themselves as reference sources

• They make money on answers that others provide for free

• Are they a bibliography, which has always been legitimate?

• Or, thanks to the hyperlink, are they exploiting those who provide the real value?

Page 67: Midnight in the Garden  of Good and Evil  Search Engines

Ethics: Index Spamming

• People learned to spam the index early on– Overload your page with terms people use in searching– Some sites present a different page to the spider than

the end user sees– One church asked a Web developer to put in meta tags

with obscene words

• Is spamming unethical?– Seems to be, but why exactly?– Sears catalog vs Montgomery Ward

Page 68: Midnight in the Garden  of Good and Evil  Search Engines

Ethics: The Search Services’ Incentives

• Most make money from banner ads

• They want to maximize page impressions and clickthroughs

• The ideal user would search forever!

• Banner ads adapt to the search based on keyword– Banner ad technology is better than result set

technology!

Page 69: Midnight in the Garden  of Good and Evil  Search Engines

Ethics: Editorial Copy Versus Advertising-Influenced

• In the print world, it’s pretty obvious what’s an advertisement– Yellow Pages– New York Times– Thomas Register

• To avoid confusion, some ads are labeled

• In the online world, it’s not always clear

• If companies sold better search positions, how would we know?

Page 70: Midnight in the Garden  of Good and Evil  Search Engines

Ethics: Buy First Place on the Hit List -- and Tell How Much You Paid!

Page 71: Midnight in the Garden  of Good and Evil  Search Engines

Paying for Position

"We tried the editorial system of rating pages, and we found that it wasn't scalable...but the market is infinitely scalable.”

– Jeffrey Brewer, CEO, GoTo.com

Page 72: Midnight in the Garden  of Good and Evil  Search Engines

The FutureThe Future

““I don’t know what the future is, but we’ll be I don’t know what the future is, but we’ll be number one!!!” -- number one!!!” -- every every tour guide tour guide

Page 73: Midnight in the Garden  of Good and Evil  Search Engines

The Future: Promises and Limits

• IR scientists say engines may be approaching fundamental limit

• Koll: typical gigabyte of searchable space holds 25,000 occurrences of typical search term

• “With a lot of work, maybe we can get to 50% recall and 50% precision”

• But combination of approaches can yield greater power

Page 74: Midnight in the Garden  of Good and Evil  Search Engines

The Search Engine Industry

• Analysts generally agree that “Yahoo wins!”!• Claims ~100 million transactions per day• Also claims 30 million unique users• Also claims more “viewership” per day than

most specialty cable TV channels (e.g. MTV)• And it’s a catalog, not a full-text engine!

Page 75: Midnight in the Garden  of Good and Evil  Search Engines

Search Engine Companies’ “Value Per User” (Mecklermedia)

VALUE INDEX UsersMarket

Value ofCompany

Value PerUser

(sorted by valueper user)

(millions) (millions)

Yahoo 32.5 $ 5,273 $ 162.38Microsoft.com 18.0 $ 1,850 $ 102.68Excite 19.3 $ 1,488 $ 76.96Lycos –Tripod 15.1 $ 992 $ 65.58Netscape.com 23.4 $ 1,500 $ 64.09Infoseek – WBS 16.2 $ 964 $ 59.50AltaVista 7.5 $ 260 $ 34.75TOTAL $ 179 $ 14,982 $ 730AVERAGE $ 18 $ 1,498 $ 73

Page 76: Midnight in the Garden  of Good and Evil  Search Engines

Changes and Alliances

• All search sites now offer browsing views

• All services offering free e-mail

• Yahoo offering news

• Alliances– AltaVista plus Real Name System– AltaVista plus Amazon.com– Lycos plus Barnes and Noble online– Yahoo drops AltaVista when AltaVista adds browsing

view

Page 77: Midnight in the Garden  of Good and Evil  Search Engines

Combining Best Features

• Of Yahoo, Infoseek, AskJeeves

• Build a knowledge base – Leverage the actual queries people issue

– An FAQ

• Offer a blend of drill-down hierarchy, knowledge base, full-text

• Search for one word yields rich result set – E.g. “Intel”

• Example: Verity’s new Knowledge Organizer

Page 78: Midnight in the Garden  of Good and Evil  Search Engines

Verity’s Knowledge Organizer Product

• A tool to capture and organize an organization’s online information– Build your own Yahoo and AltaVista-style search service

• Site builds its own topical taxonomy– Using a graphical user interface

• Tool indexes within categories and across them

• End user can – drill down within topics– search within and across topics

Page 79: Midnight in the Garden  of Good and Evil  Search Engines

A Modest Proposal: The Accidental Thesaurus

• For intranet, online product catalog, newspaper, campus sites

• Build a thesaurus based on what people look for• Don’t even try to be comprehensive• Use your search logs to find what people look

for -- and how they actually search• Fuzzy matching of user searches against

thesaurus, a la AskJeeves

Page 80: Midnight in the Garden  of Good and Evil  Search Engines

New Job Title: The Info Snout

• Like an Info Scout...only nosier

• Similar job as cataloging librarian...more like a pathfinder builder

• Daily routine:– Look at search logs– Find new terms, add to thesaurus– Also look at company newsletters, newspaper,

trade journals, etc

Page 81: Midnight in the Garden  of Good and Evil  Search Engines

Lack of Structure

• Today’s spiders effectively index every page as a separate document

• What if an OPAC did that?• The atom in a hit list should be a document,

not a page• With XML, one could define structure for

documents• But will we have one definition, or many?

Page 82: Midnight in the Garden  of Good and Evil  Search Engines

The Future• Much more intelligent engines

• Not much more intelligence in users

• The linear, undifferentiated hit list will die

• Cross-language

• Text, image, sound, video

• The “Star Trek” computer model of searching

Page 83: Midnight in the Garden  of Good and Evil  Search Engines

A Comment from the PR Person at a Major Internet

Search Service...• “I hope you are aware of our

product, and I hope your remarks will show that our product is one of the good ones, not one of the evil ones…”

• We will not name the company, but its name evokes “aurora borealis”….

Page 84: Midnight in the Garden  of Good and Evil  Search Engines

Infonortics Search Engines Conference

• Outstanding two day conference with leading search engine experts– From academe and from search industry

• Held April 1 in Boston; two previous conferences

• Scheduled for April 19-20, 1999– Back Bay Hilton, Boston

• See www.infonortics.com

Page 85: Midnight in the Garden  of Good and Evil  Search Engines

Special Thanks To...

• Judy Matthews, Michigan State University Libraries

• Lou Rosenfeld, Argus Associates

• Sue Davidsen, Michigan Electronic Library

• Julie Long, Advanced Information Consultants

Page 86: Midnight in the Garden  of Good and Evil  Search Engines

See Related Articles in June 1998 issue of Searcher• “Infonortics '98 Search Engines

Conference” article by Judy Matthews and Rich Wiggins: http://www.infotoday.com/searcher/jun/story4.htm

• Article & chart covering search engine trends by Susan Feldman: http://www.infotoday.com/searcher/jun/story2.htm

Page 87: Midnight in the Garden  of Good and Evil  Search Engines

These slides will appear...

• www.nemonline.org/present/rww