2004.09.07 - slide 1is 202 – fall 2004 prof. ray larson & prof. marc davis uc berkeley sims...
Post on 21-Dec-2015
217 views
TRANSCRIPT
2004.09.07 - SLIDE 1IS 202 – FALL 2004
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2004http://www.sims.berkeley.edu/academics/courses/is202/f04/
SIMS 202:
Information Organization
and Retrieval
Lecture 3: Intro to Information Retrieval
2004.09.07 - SLIDE 2IS 202 – FALL 2004
Lecture Overview
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2004.09.07 - SLIDE 3IS 202 – FALL 2004
Lecture Overview
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2004.09.07 - SLIDE 4IS 202 – FALL 2004
Review: Information Overload
• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)
• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)
2004.09.07 - SLIDE 5IS 202 – FALL 2004
Key Issues In This Course
• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them– Organizing
• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs– Retrieving
2004.09.07 - SLIDE 6IS 202 – FALL 2004
Key Issues
Creation
Utilization Searching
Active
Inactive
Semi-Active
Retention/Mining
Disposition
Discard
Using Creating
AuthoringModifying
OrganizingIndexing
StoringRetrieval
DistributionNetworking
AccessingFiltering
2004.09.07 - SLIDE 7IS 202 – FALL 2004
IR Topics for 202
• The Search Process• Information Retrieval Models
– Boolean, Vector, and Probabilistic
• Web-Specific Issues• Content Analysis/Zipf Distributions• Evaluation of IR Systems
– Precision/Recall– Relevance– User Studies
• User Interface Issues• Special Kinds of Search
2004.09.07 - SLIDE 8IS 202 – FALL 2004
Lecture Overview
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2004.09.07 - SLIDE 9IS 202 – FALL 2004
Web Search Questions
• What do people search for?
• How do people use search engines?– How often do people find what they are
looking for?
– How difficult is it for people to find what they are looking for?
• How can search engines be improved?
2004.09.07 - SLIDE 10IS 202 – FALL 2004
What Do People Search for on the Web?
• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html
– Survey on Excite, 13 questions– Data for 316 surveys
• (If you are interested in this, Amanda Spink has a new book entitled “Web Search: Public Searching On the Web”)
2004.09.07 - SLIDE 11IS 202 – FALL 2004
What Do People Search for on the Web?
• Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%
• Something is missing…
2004.09.07 - SLIDE 12IS 202 – FALL 2004
What Do People Search for on the Web?
• 4660 sex• 3129 yahoo• 2191 internal site admin
check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test
• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes
50,000 queries from excite 1997
Most frequent terms:
2004.09.07 - SLIDE 13IS 202 – FALL 2004
Why Do These Differ?
• Self-reporting survey
• The nature of language– Only a few ways to say certain things
– Many different ways to express most concepts• UFO, flying saucer, space ship, satellite
• How many ways are there to talk about history?
2004.09.07 - SLIDE 14IS 202 – FALL 2004
• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are• 30998255 from
• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page• 22292805 about• 22265579 com• 22107392 information
Source: http://elib.cs.berkeley.edu/docfreq/index.html
What is on the Web?
List of 31,928,892 terms from analysis of49,602,191 web pages
2004.09.07 - SLIDE 15IS 202 – FALL 2004
Intranet Queries (Aug 2000)
• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map
• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid
2004.09.07 - SLIDE 16IS 202 – FALL 2004
Intranet Queries
• Summary of sample data from 3 weeks of UCB queries– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)– 6.7% Schedule of classes or final exams (6222)– 5.4% Summer Session (5041)– 3.2% Extension (2932)– 3.1% Academic Calendar (2846)– 2.4% Directories (2202)– 1.7% Career Center (1588)– 1.7% Housing (1583)– 1.5% Map (1393)
• Average query length over last 4 months: 1.8 words• This suggests what is difficult to find from the home page
2004.09.07 - SLIDE 17IS 202 – FALL 2004
Queries as Zeitgeist
From: http:://www.google.com/press/zeitgeist.html
2004.09.07 - SLIDE 18IS 202 – FALL 2004
How DO people search?
• Different approaches for different tasks
• Models of the search process attempt to summarize how people interact with information resources when seeking information– Standard IR model– Alternative models
2004.09.07 - SLIDE 19IS 202 – FALL 2004
The Standard Retrieval Interaction Model
2004.09.07 - SLIDE 20IS 202 – FALL 2004
Standard Model of IR
• Assumptions:– The goal is maximizing precision and recall
simultaneously– The information need remains static– The value is in the resulting document set
2004.09.07 - SLIDE 21IS 202 – FALL 2004
Problems with Standard Model
• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks
• Some users don’t like long and (apparently) disorganized lists of documents
2004.09.07 - SLIDE 22IS 202 – FALL 2004
IR is an Iterative Process
Repositories/Resources
Workspace
Goals/Needs
2004.09.07 - SLIDE 23IS 202 – FALL 2004
IR is a Dialog
• The exchange doesn’t end with first answer• Users can recognize elements of a useful
answer, even when incomplete• Questions and understanding changes as the
process continues
2004.09.07 - SLIDE 24IS 202 – FALL 2004
Bates’ “Berry-Picking” Model
• Standard IR model– Assumes the information need remains the
same throughout the search process
• Berry-picking model– Interesting information is scattered like berries
among bushes– The query is continually shifting
2004.09.07 - SLIDE 25IS 202 – FALL 2004
Berry-Picking Model
Q0
Q1
Q2
Q3
Q4
Q5
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)
2004.09.07 - SLIDE 26IS 202 – FALL 2004
Berry-Picking Model (cont.)
• The query is continually shifting
• New information may yield new ideas and new directions
• The information need– Is not satisfied by a single, final retrieved set– Is satisfied by a series of selections and bits
of information found along the way
2004.09.07 - SLIDE 27IS 202 – FALL 2004
Information Seeking Behavior
• Two parts of a process:– Search and retrieval – Analysis and synthesis of search results
• This is a fuzzy area– We will look at (briefly) at some different
working theories
2004.09.07 - SLIDE 28IS 202 – FALL 2004
Search Tactics and Strategies
• Search Tactics– Bates 1979
• Search Strategies– Bates 1989– O’Day and Jeffries 1993
2004.09.07 - SLIDE 29IS 202 – FALL 2004
Tactics vs. Strategies
• Tactic: short term goals and maneuvers– Operators, actions
• Strategy: overall planning– Link a sequence of operators together to
achieve some end
2004.09.07 - SLIDE 30IS 202 – FALL 2004
Information Search Tactics
• Monitoring tactics– Keep search on track
• Source-level tactics– Navigate to and within sources
• Term and Search Formulation tactics– Designing search formulation– Selection and revision of specific terms within
search formulation
2004.09.07 - SLIDE 31IS 202 – FALL 2004
Monitoring Tactics (Strategy-Level)
• Check– Compare original goal with current state
• Weigh– Make a cost/benefit analysis of current or
anticipated actions
• Pattern– Recognize common strategies
• Correct Errors• Record
– Keep track of (incomplete) paths
2004.09.07 - SLIDE 32IS 202 – FALL 2004
Source-Level Tactics
• “Bibble”:– Look for a pre-defined result set
• E.g., a good link page on web
• Survey:– Look ahead, review available options
• E.g., don’t simply use the first term or first source that comes to mind
• Cut:– Eliminate large proportion of search domain
• E.g., search on rarest term first
2004.09.07 - SLIDE 33IS 202 – FALL 2004
Search Formulation Tactics
• Specify– Use as specific terms as possible
• Exhaust– Use all possible elements in a query
• Reduce– Subtract elements from a query
• Parallel– Use synonyms and parallel terms
• Pinpoint– Reducing parallel terms and refocusing query
• Block– To reject or block some terms, even at the cost of
losing some relevant documents
2004.09.07 - SLIDE 34IS 202 – FALL 2004
Term Tactics
• Move around a thesaurus– Superordinate, subordinate, coordinate – Neighbor (semantic or alphabetic)– Trace – pull out terms from information
already seen as part of search (titles, etc.)– Morphological and other spelling variants– Antonyms (contrary)
2004.09.07 - SLIDE 35IS 202 – FALL 2004
Additional Considerations (Bates 79)
• More detail is needed about short-term cost/benefit decision rule strategies
• When to stop?– How to judge when enough information has
been gathered?– How to decide when to give up an
unsuccessful search?– When to stop searching in one source and
move to another?
2004.09.07 - SLIDE 36IS 202 – FALL 2004
Implications
• Search interfaces should make it easy to store intermediate results
• Interfaces should make it easy to follow trails with unanticipated results (and find your way back)
• This all makes evaluation of the search, the interface and the search process more difficult
2004.09.07 - SLIDE 37IS 202 – FALL 2004
• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better
systems
More Later…
2004.09.07 - SLIDE 38IS 202 – FALL 2004
Restricted Form of the IR Problem
• The system has available only pre-existing, “canned” text passages
• Its response is limited to selecting from these passages and presenting them to the user
• It must select, say, 10 or 20 passages out of millions or billions!
2004.09.07 - SLIDE 39IS 202 – FALL 2004
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries
• This set of assumptions underlies the field of Information Retrieval
2004.09.07 - SLIDE 40IS 202 – FALL 2004
Relevance (Introduction)
• In what ways can a document be relevant to a query?– Answer precise question precisely
• Who is buried in grant’s tomb? Grant? or no one?
– Partially answer question• Where is Danville? Near Walnut Creek.• Where is Dublin?
– Suggest a source for more information.• What is lymphodema? Look in this Medical Dictionary.
– Give background information– Remind the user of other knowledge– Others...
2004.09.07 - SLIDE 41IS 202 – FALL 2004
Relevance
• “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.”
» Saracevic, 1975 p. 324
2004.09.07 - SLIDE 42IS 202 – FALL 2004
Define your own relevance
• Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor
• Where…
From Saracevic, 1975 and Schamber 1990
2004.09.07 - SLIDE 43IS 202 – FALL 2004
A. Gages
• Measure
• Degree
• Extent
• Judgement
• Estimate
• Appraisal
• Relation
2004.09.07 - SLIDE 44IS 202 – FALL 2004
B. Aspect
• Utility
• Matching
• Informativeness
• Satisfaction
• Appropriateness
• Usefulness
• Correspondence
2004.09.07 - SLIDE 45IS 202 – FALL 2004
C. Object judged
• Document
• Document representation
• Reference
• Textual form
• Information provided
• Fact
• Article
2004.09.07 - SLIDE 46IS 202 – FALL 2004
D. Frame of reference
• Question
• Question representation
• Research stage
• Information need
• Information used
• Point of view
• request
2004.09.07 - SLIDE 47IS 202 – FALL 2004
E. Assessor
• Requester
• Intermediary
• Expert
• User
• Person
• Judge
• Information specialist
2004.09.07 - SLIDE 48IS 202 – FALL 2004
Lecture Overview
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2004.09.07 - SLIDE 49IS 202 – FALL 2004
Visions of IR Systems
• Rev. John Wilkins, 1600’s : The Philosophic Language and tables
• Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification
• Emanuel Goldberg, 1920’s - 1940’s• H.G. Wells, “World Brain: The idea of a
permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937)
• Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.
2004.09.07 - SLIDE 50IS 202 – FALL 2004
Card-Based IR Systems
• Uniterm (Casey, Perry, Berry, Kent: 1958)– Developed and used from mid 1940’s)
EXCURSION 43821 90 241 52 63 34 25 66 17 58 49130 281 92 83 44 75 86 57 88 119640 122 93 104 115 146 97 158 139870 342 157 178 199 207 248 269 298
LUNAR 12457110 181 12 73 44 15 46 7 28 39430 241 42 113 74 85 76 17 78 79820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407
2004.09.07 - SLIDE 51IS 202 – FALL 2004
Card Systems
• Batten Optical Coincidence Cards (“Peek-a-Boo Cards”), 1948
Lunar
Excursion
2004.09.07 - SLIDE 52IS 202 – FALL 2004
Card Systems
• Zatocode (edge-notched cards) Mooers, 1951
Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe
Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe
Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe
2004.09.07 - SLIDE 53IS 202 – FALL 2004
Computer-Based Systems
• Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours – Due to the need to move and shift the text in
core memory while carrying out the comparisons
• 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC
2004.09.07 - SLIDE 54IS 202 – FALL 2004
Historical Milestones in IR Research
• 1958 Statistic Language Properties (Luhn)• 1960 Probabilistic Indexing (Maron & Kuhns)• 1961 Term association and clustering (Doyle)• 1965 Vector Space Model (Salton)• 1968 Query expansion (Roccio, Salton)• 1972 Statistical Weighting (Sparck-Jones)• 1975 2-Poisson Model (Harter, Bookstein,
Swanson)• 1976 Relevance Weighting (Robertson, Sparck-
Jones)• 1980 Fuzzy sets (Bookstein)• 1981 Probability without training (Croft)
2004.09.07 - SLIDE 55IS 202 – FALL 2004
Historical Milestones in IR Research (cont.)
• 1983 Linear Regression (Fox)• 1983 Probabilistic Dependence (Salton, Yu)• 1985 Generalized Vector Space Model (Wong,
Rhagavan)• 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et
al.)• 1990 Latent Semantic Indexing (Dumais,
Deerwester)• 1991 Polynomial & Logistic Regression (Cooper,
Gey, Fuhr)• 1992 TREC (Harman)• 1992 Inference networks (Turtle, Croft)• 1994 Neural networks (Kwok)
2004.09.07 - SLIDE 56IS 202 – FALL 2004
Boolean IR Systems
• Synthex at SDC, 1960• Project MAC at MIT, 1963 (interactive)• BOLD at SDC, 1964 (Harold Borko)• 1964 New York World’s Fair – Becker and
Hayes produced system to answer questions (based on airline reservation equipment)
• SDC began production for a commercial service in 1967 – ORBIT
• NASA-RECON (1966) becomes DIALOG• 1972 Data Central/Mead introduced LEXIS –
Full text• Online catalogs – late 1970’s and 1980’s
2004.09.07 - SLIDE 57IS 202 – FALL 2004
The Internet and the WWW
• Gopher, Archie, Veronica, WAIS• Tim Berners-Lee, 1991 creates WWW at
CERN – originally hypertext only• Web-crawler• Lycos• Alta Vista• Inktomi• Google• (and many others)
2004.09.07 - SLIDE 58IS 202 – FALL 2004
Information Retrieval – Historical View
• Boolean model, statistics of language (1950’s)
• Vector space model, probablistic indexing, relevance feedback (1960’s)
• Probabilistic querying (1970’s)
• Fuzzy set/logic, evidential reasoning (1980’s)
• Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)
• DIALOG, Lexus-Nexus, • STAIRS (Boolean based) • Information industry
(O($B))• Verity TOPIC (fuzzy logic)• Internet search engines
(O($100B??)) (vector space, probabilistic)
Research Industry
2004.09.07 - SLIDE 59IS 202 – FALL 2004
Lecture Overview
• Introduction to Information Retrieval
• The Information Seeking Process
• Information Retrieval History and Developments
• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey
2004.09.07 - SLIDE 60IS 202 – FALL 2004
Mano Marks on MIR
• The authors make a distinction between data retrieval and information retrieval. What is that distinction? When would data retrieval be more appropriate than information retrieval?
• When would information retrieval be more appropriate?
• In this context, what is data? What is information?
2004.09.07 - SLIDE 61IS 202 – FALL 2004
Melissa Chan on Bates
• Bates published this berry picking article in 1989 stating that real-life queries tend to shift and evolve as a user retrieves information. How does Bates search strategies of footnote chasing, citation searching, journal run, area scanning, subject searches, and author searches parallel a research search on the Internet/online libraries today? Which methods do you more frequently use?
• Online Libraries 15 Years Later...Would you need to redesign Berkeley's online library to fit the search methods listed by Bates? Does the current design limit or expand your ability to "berry pick" among the library collections? See http://melvyl.cdlib.org/
2004.09.07 - SLIDE 62IS 202 – FALL 2004
Irina Lib on Berlin
• The authors of TeamInfo put a lot of effort in organizing information into categories to minimize searching. With Google advocating the "search, not sort" approach to e-mail, do you think this approach for a group memory system? Do you think it works well for individual systems?
• TeamInfo was tested on a relatively small, homogenous group of people. Do you think a system such as TeamInfo would work well for larger, more heterogeneous groups? What problems, if any, would arise?
2004.09.07 - SLIDE 63IS 202 – FALL 2004
Jen King on Munro
• What are the possible flaws with using social navigation (“navigation towards a cluster of people or navigation because other people have looked at something”) as a theoretical framework for design? One suggestion: if we base a design upon how an aggregate of people appear to use something, we will inevitably exclude some portion of the audience who doesn’t conform to the norm (Amazon.com recommendations are a possible example of this phenomenon).
• Non-verbal cues are an important element of human communication. Could social navigation help provide the contextual cues that non-verbal communication provides with helping individuals comprehend information?
2004.09.07 - SLIDE 64IS 202 – FALL 2004
Jen King on Munro
• The central point of social navigation made in the reading is a shift from thinking about computers as external objects humans act upon to a ubiquitous computing environment where humans are engaged with computers in many contexts, both individually and as part of a social group. The authors note that an alternate design possibility includes a “move away from ‘dead’ information spaces we see on the Internet today and in every way possible open up the spaces for seeing other users — both directly and indirectly,” (p.6) or in other words, creating a “virtual reality” where the presence of other people (and not merely unidirectional web pages) define the environment. Have you encountered any computerized social environments that you thought worked well? If not, how did they fail? Do you agree that interacting directly with other users online is the future of information spaces?
2004.09.07 - SLIDE 65IS 202 – FALL 2004
Next Time
• Boolean Queries and Text Processing
• Readings (note – slight rearrangement of the web site and readings)– (Background) MIR Ch. 2 and Ch. 4– How to Use Controlled Vocabularies More
Effectively in Online Searching (Bates)– Improving Full-Text Precision on Short
Queries using Simple Constraints (Hearst)