lecture 23: interfaces for information retrieval

69
2002.11.19 - SLIDE 1 IS 202 – FALL 2002 Lecture 23: Interfaces for Information Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/ is202/f02/ SIMS 202: Information Organization and Retrieval

Upload: nadine-snyder

Post on 30-Dec-2015

24 views

Category:

Documents


1 download

DESCRIPTION

Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/. Lecture 23: Interfaces for Information Retrieval. SIMS 202: Information Organization and Retrieval. Lecture Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 1IS 202 – FALL 2002

Lecture 23: Interfaces for Information Retrieval

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2002http://www.sims.berkeley.edu/academics/courses/is202/f02/

SIMS 202:

Information Organization

and Retrieval

Page 2: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 2IS 202 – FALL 2002

Lecture Overview

• Review and Continuation– Web Search Engines and Algorithms

• Interfaces for Information Retrieval– Introduction to HCI

– Why Interfaces Don’t Work

– Early Visions: Memex

Credit for some of the slides in this lecture goes to Marti Hearst

Page 3: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 3IS 202 – FALL 2002

Lecture Overview

• Review and Continuation– Web Search Engines and Algorithms

• Interfaces for Information Retrieval– Introduction to HCI

– Why Interfaces Don’t Work

– Early Visions: Memex

Credit for some of the slides in this lecture goes to Marti Hearst

Page 4: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 4IS 202 – FALL 2002

Search Engines

• Crawling

• Indexing

• Querying

Page 5: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 5IS 202 – FALL 2002

Web Search Engine Layers

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Page 6: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 6IS 202 – FALL 2002

Standard Web Search Engine Architecture

crawl theweb

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

Page 7: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 7IS 202 – FALL 2002

More detailed architecture

(Brin & Page 98)

Only covers the preprocessing in

detail, not the query serving

Google Web Search Architecture

Page 8: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 8IS 202 – FALL 2002

Indexes for Web Search Engines

• Inverted indexes are still used, even though the web is so huge

• Some systems partition the indexes across different machines– Each machine handles different parts of the

data

• Other systems duplicate the data across many machines– Queries are distributed among the machines

• Most do a combination of these

Page 9: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 9IS 202 – FALL 2002

Search Engine QueryingIn this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/

frame.htm

Page 10: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 10IS 202 – FALL 2002

Querying: Cascading Allocation of CPUs

• A variation on this that produces a cost-savings:– Put high-quality/common pages on many

machines– Put lower quality/less common pages on

fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines

Page 11: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 11IS 202 – FALL 2002

Google

• Google maintains the worlds largest Linux cluster (10,000 servers)

• These are partitioned between index servers and page servers– Index servers resolve the queries (massively

parallel processing)– Page servers deliver the results of the queries

• Over 3 Billion web pages are indexed and served by Google

Page 12: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 12IS 202 – FALL 2002

Search Engine Indexes

• Starting Points for Users include

• Manually compiled lists– Directories

• Page “popularity”– Frequently visited pages (in general)– Frequently visited pages as a result of a query

• Link “co-citation”– Which sites are linked to by other sites?

Page 13: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 13IS 202 – FALL 2002

Starting Points: What is Really Being Used?

• Todays search engines combine these methods in various ways– Integration of Directories

• Today most web search engines integrate categories into the results listings

• Lycos, MSN, Google

– Link analysis• Google uses it; others are also using it• Words on the links seems to be especially useful

– Page popularity• Many use DirectHit’s popularity rankings

Page 14: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 14IS 202 – FALL 2002

Web Page Ranking

• Varies by search engine– Pretty messy in many cases– Details usually proprietary and fluctuating

• Combining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity information

Page 15: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 15IS 202 – FALL 2002

Ranking: Hearst ‘96

• Proximity search can help get high-precision results if >1 term– Combine Boolean and passage-level

proximity– Proves significant improvements when

retrieving top 5, 10, 20, 30 documents– Results reproduced by Mitra et al. 98– Google uses something similar

Page 16: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 16IS 202 – FALL 2002

Ranking: Link Analysis

• Assumptions:– If the pages pointing to this page are good,

then this is also a good page– The words on the links pointing to this page

are useful indicators of what this page is about

– References: Page et al. 98, Kleinberg 98

Page 17: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 17IS 202 – FALL 2002

Ranking: Link Analysis

• Why does this work?– The official Toyota site will be linked to by lots

of other official (or high-quality) sites– The best Toyota fan-club site probably also

has many links pointing to it– Less high-quality sites do not have as many

high-quality sites linking to them

Page 18: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 18IS 202 – FALL 2002

Ranking: PageRank

• Google uses the PageRank• We assume page A has pages T1...Tn which

point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0.85. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

• Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

Page 19: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 19IS 202 – FALL 2002

PageRank

• Similar to calculations used in scientific citation analysis (e.g., Garfield et al.) and social network analysis (e.g., Waserman et al.)

• Similar to other work on ranking (e.g., the hubs and authorities of Kleinberg et al.)

• Computation is an iterative algorithm and converges to the principle eigenvector of the link matrix

Page 20: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 20IS 202 – FALL 2002

Lecture Overview

• Review and Continuation– Web Search Engines and Algorithms

• Interfaces for Information Retrieval– Introduction to HCI

– Why Interfaces Don’t Work

– Early Visions: Memex

Credit for some of the slides in this lecture goes to Marti Hearst

Page 21: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 21IS 202 – FALL 2002

“Drawing the Circles”

Page 22: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 22IS 202 – FALL 2002

“Drawing the Circles”

Page 23: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 23IS 202 – FALL 2002

“Drawing the Circles”

Page 24: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 24IS 202 – FALL 2002

“Drawing the Circles”

Page 25: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 25IS 202 – FALL 2002

“Drawing the Circles”

Page 26: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 26IS 202 – FALL 2002

“Drawing the Circles”

Page 27: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 27IS 202 – FALL 2002

“Drawing the Circles”

Page 28: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 28IS 202 – FALL 2002

“Drawing the Circles”

Page 29: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 29IS 202 – FALL 2002

“Drawing the Circles”

Page 30: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 30

Human-Computer Interaction (HCI)

• Human– The end-users of a program– The others in the organization

• Computer– The machines the programs run on

• Interaction– The users tell the computers what they want– The computers communicate results

Page 31: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 31

What is HCI?

HumansTechnology

Task

Design

Organizational & Social Issues

Page 32: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 32IS 202 – FALL 2002

Shneiderman on HCI

• Well-designed interactive computer systems– Promote

• Positive feelings of success• Competence• Mastery

– Allow users to concentrate on their work, exploration, or pleasure, rather than on the system or the interface

Page 33: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 33

Design Guidelines

• Set of design rules to follow

• Apply at multiple levels of design

• Are neither complete nor orthogonal

• Have psychological underpinnings (ideally)

Page 34: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 34IS 202 – FALL 2002

Shneiderman’s Design Principles

• Provide informative feedback

• Permit easy reversal of actions

• Support an internal locus of control

• Reduce working memory load

• Provide alternative interfaces for expert and novice users

Page 35: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 35IS 202 – FALL 2002

Provide Informative Feedback

• About:– The relationship between query specification

and documents retrieved– Relationships among retrieved documents– Relationships between retrieved documents

and metadata describing collections

Page 36: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 36IS 202 – FALL 2002

Reduce Working Memory Load

• Provide mechanisms for keeping track of choices made during the search process

• Allow users to:– Return to temporarily abandoned strategies– Jump from one strategy to the next– Retain information and context across search

sessions

• Provide browsable information that is relevant to the current stage of the search process– Related terms or metadata– Search starting points (e.g., lists of sources, topic

lists)

Page 37: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 37IS 202 – FALL 2002

Interfaces For Expert And Novice Users

• Simplicity vs. power tradeoffs

• “Scaffolded” user interface

• How much information to show the user?– Number and complexity of user operations– Variants of operations– Inner workings of system itself– System history

• Example:– Television remote control

Page 38: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 38IS 202 – FALL 2002

User Differences

• Abilities, preferences, predilections– Spatial ability– Memory– Reasoning abilities– Verbal aptitudes– Personality differences– Age, gender, ethnicity, class, sexuality,

culture, education– Modalilty preferences/restrictions

• Vision, audition, speech, gesture, haptics, locomotion

Page 39: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 39

Nielsen’s Usability Slogans

• Your best guess is not good enough

• The user is always right

• The user is not always right

• Users are not designers

• Designers are not users

• Less is more

• Details matter

(from Nielsen’s “Usability Engineering”)

Page 40: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 40

Who Builds UIs?

• A team of specialists (ideally)– Graphic designers– Interaction / interface designers– Technical writers– Marketers– Test engineers– Software engineers

Page 41: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 41

How to Design and Build UIs

• Task analysis

• Rapid prototyping

• Evaluation

• Implementation

Design

Prototype

Evaluate

Iterate at every stage!

Page 42: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 42

Task Analysis

• Observe existing work practices

• Create examples and scenarios of actual use

• Try out new ideas before building software

Page 43: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 43

Rapid Prototyping

• Build a mock-up of design

• Low fidelity techniques– Paper sketches– Cut, copy, paste– Video segments

• Interactive prototyping tools– Visual Basic, HyperCard, Director, etc.

• UI builders– NeXT, etc.

Page 44: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 44IS 202 – FALL 2002

Evaluation Techniques

• Qualitative vs. quantitative methods• Qualitative (non-numeric, discursive,

ethnographic)– Focus groups– Interviews– Surveys– User observation– Participatory design sessions

• Quantitative (numeric, statistical, empirical)– User testing– System testing

Page 45: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 45IS 202 – FALL 2002

Qualitative Questions

• User experience

• User preferences

• User recommendations

• “Design dialogue”

Page 46: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 46IS 202 – FALL 2002

Quantitative Questions

• Precision

• Recall

• Time required to learn the system

• Time required to achieve goals on benchmark tasks

• Error rates

• Retention of the use of the interface over time

Page 47: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 47IS 202 – FALL 2002

Information Visualization

• Utility– Inherently visual data– Making the abstract concrete– Making the invisible visible

• Techniques– Icons– Color highlighting– Brushing and linking– Panning and zooming– Focus-plus-context– Magic lenses– Animation

Page 48: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 48IS 202 – FALL 2002

Lecture Overview

• Review and Continuation– Web Search Engines and Algorithms

• Interfaces for Information Retrieval– Introduction to HCI

– Why Interfaces Don’t Work

– Early Visions: Memex

Credit for some of the slides in this lecture goes to Marti Hearst

Page 49: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 49IS 202 – FALL 2002

Why Interfaces Don’t Work

• Because…– We still think of using the interface– We still talk of designing the interface– We still talk of improving the interface

• “We need to aid the task, not the interface to the task.”

• “The computer of the future should be invisible.”

Page 50: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 50IS 202 – FALL 2002

Norman on Design Priorities

1. The user—what does the person really need to have accomplished?

2. The task—analyze the task. How best can the job be done?, taking into account the whole setting in which it is embedded, including the other tasks to be accomplished, the social setting, the people, and the organization.

3. As much as possible, make the task dominate; make the tools invisible.

4. Then, get the interaction right, making things the right things visible, exploiting affordances and constraints, providing the proper mental models, and so on—the rules of good design for the user, written about many, many times in many, many places.

Page 51: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 51IS 202 – FALL 2002

Lecture Overview

• Review and Continuation– Web Search Engines and Algorithms

• Interfaces for Information Retrieval– Introduction to HCI

– Why Interfaces Don’t Work

– Early Visions: Memex

Credit for some of the slides in this lecture goes to Marti Hearst

Page 52: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 52IS 202 – FALL 2002

“What Dr. Bush Foresees”

Cyclops CameraWorn on forehead, it would photograph anything you see and want to record. Film would be developed at once by dry photography.

MicrofilmIt could reduce Encyclopaedia Britannica to volume of a matchbox. Material cost: 5¢. Thus a whole library could be kept in a desk.

VocoderA machine which could type when talked to. But you might have to talk a special phonetic language to this mechanical supersecretary.

Thinking machineA development of the mathematical calculator. Give it premises and it would pass out conclusions, all in accordance with logic.

MemexAn aid to memory. Like the brain, Memex would file material by association. Press a key and it would run through a “trail” of facts.

Page 53: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 53IS 202 – FALL 2002

Memex

Page 54: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 54IS 202 – FALL 2002

Memex Detail

Page 55: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 55IS 202 – FALL 2002

Cyclops Camera

Page 56: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 56IS 202 – FALL 2002

Vocoder: “Supersecretary”

Page 57: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 57IS 202 – FALL 2002

Investigator at Work

• “One can now picture a future investigator in his laboratory. His hands are free, and he is not anchored. As he moves about and observes, he photographs and comments. Time is automatically recorded to tie the two records together. If he goes into the field, he may be connected by radio to his recorder. As he ponders over his notes in the evening, he again talks his comments into the record. His typed record, as well as his photographs, may be both in miniature, so that he projects them for examination.”

Page 58: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 58IS 202 – FALL 2002

Memex

• “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”

Page 59: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 59IS 202 – FALL 2002

Associative Indexing

• “[…] associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of memex. The process of tying two items together is the important thing.”

Page 60: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 60IS 202 – FALL 2002

The WWW circa 1945

• “It is exactly as though the physical items had been gathered together from widely separated sources and bound together to form a new book. But it is more than this; for any item can be joined into numerous trails, the trails can bifurcate, and they can give birth to side trails.”

• “Wholly new forms of encyclopaedias will appear, ready-made with a mesh of associative trails running them, ready to be dropped into the memex and there amplified.”

Page 61: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 61IS 202 – FALL 2002

Selection

• “The heart of the problem, and of the personal machine we have here considered, is the task of selection. And here, in spite of great progress, we are still lame.

• Selection, in the broad sense, is still a stone adze in the hands of a cabinetmaker.”

—“Memex Revisited” (Bush 1965)

Page 62: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 62IS 202 – FALL 2002

Interaction Paradigms for IR

• Direct manipulation– Query specification– Query refinement– Result selection

• Delegation– Agents– Recommender systems– Filtering

Page 63: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 63IS 202 – FALL 2002

The “Adaptive” Memex

• “In an adaptive Memex, the owner has delegated to the machine the ability to propose or effect changes in the stored information. By analogy to business practice, the Memex is said to be functioning as an agent (Kay, 1984). The machine is playing an autonomous role within a restricted charter: to attempt a more effective organization of the information based on observations of actual use and topical similarities.”

Page 64: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 64IS 202 – FALL 2002

Next Time: HCI For IR

• Browsing– Visualizing collections and documents– Navigating collections and documents

• Searching– Formulating queries– Visualizing results– Navigating results– Refining queries– Selecting results

Page 65: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 65IS 202 – FALL 2002

Next Time: HCI for IR

• Interfaces for Information Retrieval

• Readings– MIR 10.4 – 10.10

Page 66: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 66IS 202 – FALL 2002

Cuts

Page 67: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 67IS 202 – FALL 2002

Memex

Page 68: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 68IS 202 – FALL 2002

Cyclops Camera

Page 69: Lecture 23: Interfaces for Information Retrieval

2002.11.19 - SLIDE 69IS 202 – FALL 2002

Supersecretary