how to cha-cha organizing intranet search results marti hearst and mike chen uc berkeley stanford...

47
How to Cha-Cha How to Cha-Cha Organizing Intranet Search Results Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Post on 22-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

How to Cha-ChaHow to Cha-ChaOrganizing Intranet Search ResultsOrganizing Intranet Search Results

Marti Hearst and Mike Chen

UC Berkeley

Stanford Digital Libraries Seminar

June 2, 1999

Page 2: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

OutlineOutline Main Idea, Motivation System Implementation User Interface Assessment Related and Future Work

Page 3: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

PeoplePeople Principals: Mike Chen and Marti Hearst Early coding: Jason Hong Early UI evaluation: Jimmy Lin, Mike Chen Informal UI evaluation: Shiang-Ling Chen

Page 4: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Intranet SearchIntranet Search Documents used in a large, diverse Intranet,

e.g.,• University.edu

• Corporation.com

• Government.gov

Hypothesis: It is meaningful to group search results according to organizational structure

Page 5: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Searc

hin

g E

art

hquake

s at

UC

B:

Searc

hin

g E

art

hquake

s at

UC

B:

Sta

ndard

Way

Sta

ndard

Way

Page 6: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Searc

hin

g E

art

hquake

s at

UC

BSearc

hin

g E

art

hquake

s at

UC

Bw

ith C

ha-C

ha

wit

h C

ha-C

ha

Page 7: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha GoalsCha-Cha Goals Better Intranet search

– integrate searching and browsing

– provide context for search results

– familiarize users with the site structure

UI– minimal browser requirement

• widely usable HTML interface

– build on user familiarity with existing systems

Page 8: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha and Source Cha-Cha and Source SelectionSelection Shows available sources

• Sources are major web sites• User may want to navigate the source rather than go

directly to the search hits• Gives hints about relative importance of various

sources

Reveals the structure of the site while tightly integrating this structure with search

• Users tell us anecdotally that the outline view is useful for finding starting points

Page 9: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

System OverviewSystem Overview Collect shortest paths for each page.

– Global paths: from root of the domain

– Local paths: from root of the server

– Select “the best” path based on the query

User interaction with the system:

Cha-Cha Cheshire

1. query 2. query

3. hits

4. select paths & generate HTML

5. HTML

Page 10: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Current StatusCurrent Status Over 200,000 pages indexed About 2500 queries/weekday Less than 3 sec/query on average Five subdomains using it as site search

engine– eecs

• millennium project

– sims

– law

– career center

Page 11: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

OutlineOutline Main Idea, Motivation System Implementation User Interface Assessment Related and Future Work

Page 12: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha PreprocessingCha-Cha Preprocessing

Page 13: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Overview of Cha-Cha Overview of Cha-Cha PreprocessingPreprocessing

Crawl entire Intranet– Store copies of pages locally– 200,000 pages on the UCB Intranet

Revisit all the pages again (on disk)– Create metadata for each page– Compute the shortest hyperlink path from a certain

root page to every web page• both global and local paths

Index all the pages– Using Cheshire II (Ray Larson, SIMS)– Index full text, titles, shortest paths separately

Page 14: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Web Crawling AlgorithmWeb Crawling Algorithm Start with a list of servers to crawl

– for UCB, simply start with www.berkeley.edu

Restrict crawl to certain domain(s)– *.berkeley.edu

Obey No Robots standard Follow hyperlinks only

– do not read local filesystems• links are placed on a queue• traversal is breadth-first

Page 15: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Web Crawling Algorithm Web Crawling Algorithm (cont.)(cont.)

Interpret the HTML on each web page Record the text of the page in a file on disk.

– Make a list of all the pages that this page links to (outlinks)– Follow those links one at a time, repeating this procedure

for each page found, until no unexplored pages are left.• links are placed on a queue• traversal is breadth-first• urls that have been crawled are stored in a hash table in

memory, to avoid repeats

Page 16: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Custom Web CrawlerCustom Web Crawler Special considerations

– full coverage• web search engines don’t go very deep• web search engines skip problematic sites

– search on “Berdahl” at snap: 430 hits– search on “Berdahl” on Cha-Cha: XXX hits

– solution• tag each URL with a retry counter• if server is down, put URL at the end of the queue and

decrement the retry counter• if the counter is 0, give up on the URL

Page 17: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Custom Web CrawlerCustom Web Crawler Special considerations

– servers with multiple names• info.berkeley.edu == www.sims.berkeley.edu

– solution:• hash the home page of the server into a table• whenever a new server is found, compare its homepage

to those in the table• if a duplicate, record the new server’s name as being the

same as the original server’s

Page 18: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha MetadataCha-Cha Metadata Information about web pages

– Title

– Length

– Inlinks

– Outlinks

– Shortest paths from a root home page

Page 19: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Metafile GeneratorMetafile Generator Main task: find shortest path information

– Two passes: global and local

Global pass:– start with main home page H (www.berkeley.edu)

– find shortest path from H to every page in the system

– when this is done, write out a metafile for each page

Page 20: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Metafile GeneratorMetafile Generator Local pass:

– start with a list of all the servers found during the crawl

– for each server S• find shortest path from S to every page in the system• do this the same way as in the global pass but store

the results in a different database• when done, write out a metafile for each page, in a

different directory than for the global pass

Page 21: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Metafile Generator (cont.)Metafile Generator (cont.) Combine local and global path information Purpose:

– locality should “trump” global paths, but not all local pages are reachable locally

– example: • the shortest path from www.berkeley.edu to

www.sims.berkeley.edu/~hearst is:

www.berkeley.edu -> search.berkeley.edu -> cha-cha.berkeley.edu -> www.sims.berkeley.edu/~hearst

• but we want my home page to be under the SIMS faculty listing

• solution: let local trump global

Page 22: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Metafile Generator (cont.)Metafile Generator (cont.) Combine local and global path information How to do it:

– go through the metafiles in the global directory

– for each metafile• if there already is a metafile for that url in the local

directory, skip this metafile• otherwise (there is not metafile for this url locally) copy

the metafile into the local directory

Why not just use local metafiles?– some pages are not linked to within their own domain

• e.g., student association hosted within a particular student’s domain

Page 23: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Sample Cha-Cha Metadata Sample Cha-Cha Metadata filefile

<METAFILE><Url>http://www.sims.berkeley.edu/</Url><Title>Welcome to SIMS</Title><Date>null</Date><Size>4865</Size>

<!-- INLINKS --><InlinkCount>1</InlinkCount><Inlinks>http://www-resources.berkeley.edu/nhpteaching/</Inlinks>

<!-- OUTLINKS --><OutlinkCount>21</OutlinkCount><Outlinks>http://www.sims.berkeley.edu/about.htmlhttp://www.sims.berkeley.edu/search.htmlhttp://www.sims.berkeley.edu/events/conferences/http://www.sims.berkeley.edu/resources/sites.htmlhttp://www.sims.berkeley.edu/people/masters.html

Page 24: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

CHESHIRE IICHESHIRE II Search back-end for Cha-Cha

– Ray Larson et al. ASIS 95, JASIS 96

CHESHIRE II system:• Full Service Full Text Search• Client/Server architecture• Z39.50 IR protocol• Interprets documents written in SGML• Probabilistic Ranking• Flexible data representation

Page 25: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

CHESHIRE II (cont.)CHESHIRE II (cont.) A big advantage of Cheshire:

– don’t have to write a special parser for special document types

– instead, simply create one DTD and the system takes care of parsing the metafiles for us

A related advantage:– can create indexes on individual components of

the document • allows efficient title search, home page search,

domain-based search, without extra programming

Page 26: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha Document Type Cha-Cha Document Type DefinitionDefinition<!SGML "ISO 8879:1986"----

CHARSET BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 95 32 255 1 UNUSED

Page 27: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT METAFILE - - (URL, TITLE, DATE, SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT, OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?, SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?, TYPE?, DOMAIN?, FILE?)>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT URL - o (#PCDATA)><!ELEMENT DATE - o (#PCDATA)><!ELEMENT TITLE - o (#PCDATA)><!ELEMENT SIZE - o (#PCDATA)><!ELEMENT INLINKCOUNT - o (#PCDATA)><!ELEMENT INLINKS - o (#PCDATA)><!ELEMENT OUTLINKCOUNT - o (#PCDATA)><!ELEMENT OUTLINKS - o (#PCDATA)><!ELEMENT DEPTH - o (#PCDATA)><!ELEMENT SHORTESTPATHSCOUNT - o (#PCDATA)><!ELEMENT SHORTESTPATHS - o (#PCDATA)>

Cha-Cha DTD, cont. (parts Cha-Cha DTD, cont. (parts omitted)omitted)

Page 28: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Cha-Cha Online ProcessingCha-Cha Online Processing

Page 29: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View Main issue: how to combine shortest paths

– There are approximately three shortest paths per web page

– We assume users do not want to see the page multiple times

Strategy:– Group hits together within the hierarchy – Try to avoid showing subhierarchies with singleton

hits• This assumption is based on part on evidence from our

earlier clustering research that relevant documents tend to cluster near one another

Page 30: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View (cont.)(cont.) Goals of the algorithm:

– (I) Group (recursively) as many pages together within a subhierarchy as possible• Avoid (recursively) branches that terminate in only

one hit (leaf)

– (II) Remove as many internal nodes as possible while while stil retaining at least one valid path to every leaf

– (iii) Remove as many edges as possible while retaining at lesat one path to every leaf

Page 31: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View (cont.)(cont.) To achieve these goals we need a non-

standard graph algorithm– To do it properly, every possible subset of nodes

at depth D should be considered to determine the minimal subset which covers all nodes at depth D+1

– This is inefficient -- would require 2^k checks for k nodes at depth D

Instead, we use a heuristic approach which approximates the optimal results

Page 32: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View (cont.)(cont.) First, a top-down pass

– record depth of each node and the number of children it links to directly

Second, a bottom-up pass– identify the deepest nodes (the leaves)

– D <- the set of nodes that are parents of leaves

– Sort D ascending according to how many active children they link to at depth D+1

– A node is active if it has not been eliminated

Page 33: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View (cont.)(cont.) Bottom-up pass, continued

– every node is a candidate to be eliminated– those nodes with the least number of children are

eliminated first• because of goal (I)

– for each candidate C, if C links to one or more active nodes at depth D+1 that are not covered by any active nodes, then C cannot be eliminated. Otherwise, C is removed from the active list

After a level D is complete, there are no active nodes at depth D that cover exclusively nodes that are also covered by another node at depth D

Page 34: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Building the Outline View Building the Outline View (cont.)(cont.) Retaining rank ordering

– Build up the tree by first placing in the tree the hit (leaf) that is highest ranked

– As more leaves are added, more parts of the hierarchy are added, but the order in which the parts of the hierarchy are added is retained

When the hierarchy has been built, it is traversed to create the HTML listing

Entire procedure is very fast for O(100) hits

Page 35: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

OutlineOutline Main Idea, Motivation System Implementation User Interface Assessment Related and Future Work

Page 36: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Pilot StudyPilot Study May, 1998 4 interfaces 7 subjects 10K pages, Inktomi as backend

Frames ListCategorizedNo Frames

Page 37: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Pilot StudyPilot Study Results

– global preference: 1 - good, 4 - bad

Problems– small data set

• “clean” results• depth < 5

– global paths only

Ave.Frames 2.1Categorized 2.1List 2.4No Frames 3.3

Page 38: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Follow-up StudyFollow-up Study September, 1998 2 interfaces: outline and list view 18 subjects 50K pages

Outline List

Page 39: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Follow-up StudyFollow-up Study Results

– 30% faster with outline view (10% conf. level)– slight preference for list view!?

Problems– timing

• users “trapped” by external sites

– deep hits• global paths only with no aggressive path trimming

– arbitrarily chosen questions– page abstract

• didn’t use “plain” abstract for the list view

Page 40: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Web SurveyWeb Survey UCB’s homepage

– November/December 1998

– 162 participants

Results – outline view useful?

• 56 yes, 31 no, 75 left blank

– which do you prefer?• 38 Cha-Cha, 21 Snap, 61 No Opinion, 42 left blank

Problems– self-selected population

– small sample

Page 41: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Overall AssessmentOverall Assessment Survey, studies, and anecdotal information suggest

– A significant proportion of users find the outline view helpful and like it quite a bit• Often these people note it isn’t helpful for every query but

often helps with difficult ones

– A significant proportion of users don’t particularly like it and prefer a standard list view

Claim– An interface that helps/is liked by a significant proportion

of users is a useful contribution.– Interface design needs to take individual differences into

account

Page 42: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Open UI IssuesOpen UI Issues(we have preliminary results)(we have preliminary results)

Repeating domains– The same domain could show up on multiple

pages.

Where am I?– How to present the current domain(s)?

– How to specify sub-domain/super-domain searches?

How to compact hierarchy?– e.g. mailing lists

Page 43: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Open UI IssuesOpen UI Issues(we have preliminary results)(we have preliminary results)

Is the two-tier/frames view better?– Implemented outline view to keep the server simple.

– Showing only 3 levels may not be enough.

– How to align the abstracts?

– Do we repeat the hierarchy?

How to scale down to fit small displays? Suggestions to improve the outline view. Integrating UI with other sites.

– What to customize?

Page 44: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

OutlineOutline Main Idea, Motivation System Implementation User Interface Assessment Related and Future Work

Page 45: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Related WorkRelated Work Superbook (Remde et al. 87, Egan et al. 89) WebTOC (Nation 97) AMIT (Wittenburg 97) WebCutter (Maarek 97)

Page 46: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

SummarySummary Better user interfaces for search should:

– Help users understand starting points/sources

– Places results of search into an organizing context

One (of many) approaches– Cha-Cha: simultaneously browse and search

intranet site context

Page 47: How to Cha-Cha Organizing Intranet Search Results Marti Hearst and Mike Chen UC Berkeley Stanford Digital Libraries Seminar June 2, 1999

Future WorkFuture Work Small Interface Improvements

– Searching within subdomains

– Indicating repeating subdomains

Ranking– Special handling for short queries / web-specific

ranking / phrase queries

– Spelling corrections suggestions

Major Change: – Integrating with topic-centric metadata