infm 700: session 9 search (part ii) search engines in information architecture paul jacobs the...
TRANSCRIPT
INFM 700: Session 9
Search (Part II)Search Engines in Information Architecture
Paul JacobsThe iSchoolUniversity of Maryland
Wednesday, Apr. 18, 2012
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
iSchool
Today’s Topics Very short recap
Fundamentals of information retrieval Search engines in practice (web search and web sites)
Issues and tricks Stemming/word issues Query formulation/expansion/assistance Tagging/structuring Others
Deploying search – what we get to do, and howIssues and Tricks
Deploying Search
iSchool
Vector Space Model
Assumption: Documents that are “close together” in vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
iSchool
Term Weighting Term weights consist of two components
Local: how important is the term in this doc? Global: how important is the term in the collection?
Here’s the intuition: Terms that appear often in a document should get high
weights Terms that appear in many documents should get low
weights
How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)
iSchool
TF.IDF Term Weighting
ijiji n
Nw logtf ,,
jiw ,
ji ,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
iSchool
Summary thus far… Represent documents (and queries) as “bags of
words” (terms)
Derive term weights based on frequency
Use weighted term vectors for each document, query
Compute a vector-based similarity score
Display sorted, ranked results
iSchool
Issues and Tricks What’s a word/term?
We can ignore words (“stop words”), combine (phrases), split up (“stem”) words
Other special treatment (e.g. names, categories)
Query formulation/suggestion
Type of information need
Popularity Based on link analysis/page rank Based on click through, other
Structuring and tagging (e.g., “best bets”)
Issues and Tricks
Deploying Search
iSchool
Issues and Tricks (cont’d) Thesaurus/query expansion
Based on meaning, conceptual relationships Based on decomposition/type
User feedback/”More like this”
Clustering/grouping of results
Issues and Tricks
Deploying Search
iSchool
Morphological Variation Handling morphology: related concepts have
different forms Inflectional morphology: same part of speech
Derivational morphology: different parts of speech
Different morphological processes: Prefixing Suffixing Infixing Reduplication
dogs = dog + PLURAL
broke = break + PAST
destruction = destroy + ion
researcher = research + er
Issues and Tricks
Deploying Search
iSchool
Stemming Dealing with morphological variation: index stems
instead of words Stem: a word equivalence class that preserves the
central concept
How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?
Issues and Tricks
Deploying Search
iSchool
Does Stemming Work? Generally, yes! (in English)
Helps more for longer queries, fewer results Lots of work done in this area
But used very sparingly in web search – why?
Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.
Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.
And others…
Issues and Tricks
Deploying Search
iSchool
Beyond Words… Stemming/tokenization = specific instance of a
general problem: what is it?
Other units of indexing Concepts (e.g., from WordNet) Named entities Relations …
Issues and Tricks
Deploying Search
iSchool
Some Observations Search engine fundamentals are very similar
There are many tricks, differences beyond the basic model
Differences appear differently, and are magnified as we get to sites, specific applications
So, as we get to deployment … Be skeptical Test rigorously Some small things can make a big difference
Issues and Tricks
Deploying Search
iSchool
Deployment - Overview What we can control
Basic process of setting up/using search in IA
Key parameters/issues What to search/organization content Testing and improving results Presentation/interfaces
Issues and Tricks
Deploying Search
iSchool
What we control (the IA part)? Requirements and search engine selection
Developing search requirements Build vs. buy Vendor evaluation/selection Consultants?
Content selection What to search/zones/etc. Tags
Search engine configuration Zones, what gets indexed, sometimes how Number of results, sometimes recall vs. precision Others (very often interface-related)
Interfaces
Issues and Tricks
Deploying Search
iSchool
Search Engine Selection Commercial examples
Autonomy (including the former Verity, Ultraseek, . . .) Google (site search, search appliance) Thunderstone
Build your own, open source? Lucene
Defining requirements Basic search – how big, type of documents, what sort of
interface, metadata, parametric? Advanced requirements – automatic tagging, alerts,
“more like this” Customization and improvement using logs Keep it focused?
Issues and Tricks
Deploying Search
iSchool
Search Engine Selection (con’d) Pitfalls to avoid
“Getting a bargain” Getting it “free” Great sales reps
Good ideas Get case studies, talk to references Get a “proof of concept period”
Issues and Tricks
Deploying Search
iSchool
Simple Requirements Matrix
Issues and Tricks
Deploying Search
Vendor Name
Requirement/Criterion Priority Rating Comments
1. Identify Early Warnings/Search1.a. Highly detailed information needs 11.b. Date range restrictions 11.c. Company name restrictions 11.d. Alias/equivalence (e.g The Walt Disney Company = Disney) 21.e. Ability to assign unique IDs (e.g. Disney = NYSE:DIS) 21 f. Restrict/search by subject area/topic 21.g. Ability to partition/segment articles with multiple topics 31.h. Federated search w/web content, Nexis, etc. 21.i. Use of extended lists (e.g. lists of companies, subjects) 2
2. Identify Early Warnings/Alerts2.a. Highly detailed information needs (all of i-h above) 12.b. Controlling/weighting specific elements 22.c. Recall/precision tradeoff 32.d. Identify "new" and "hot" articles that match user's interest 22.e. Sentiment analysis component 3
3. Identify Early Warnings/Discovery3.a. Classify documents in pre-defined or user-defined categories 33.b. Document clustering 33.c. Identification of trends/issues 23.d. Other discovery tools 3
4. Integration and interface requirements
iSchool
Content Selection (What to Search) Generally, search everything but …
Be leery about providing “search the web” option Use zones or separate text databases for
frequent/infrequent information needs Be careful about outdated/deleted content Make sure “best bets” come to the top
Use logs, test & improve
Issues and Tricks
Deploying Search
iSchool
Testing and Improvement
Keep track of queries (and results, if possible) using logs If logs are not available, try user experiments If results are not available, get them Relevance/correct judgments; quantitative (e.g.
recall/precision) scores are, too
How to improve Focus on most frequent (important?) requests (90-10 or
80-20) “Best bets” Content manipulation (e.g., adding tags) Thesaurus Keep testing
Issues and Tricks
Deploying Search
iSchool
“Best Bets” – How to Implement
Identify desired result page
Determine possible query strings (from logs)
Tag meta-data in documents with query string
Configure search interface (e.g., to show Best Best first, what to do about multiple Best Bets)
This is a special case of using tag field (e.g., keywords, categories, description)
Issues and Tricks
Deploying Search
iSchool
Designing a Search Interface
The Box (size, position, labels)
Content selection (defaults, radio buttons or pull-down selection)
Parameters or advanced search (Booleans, separate zones, other possibilities)
Issues and Tricks
Deploying Search
iSchool
Designing a Search Interface - Results Number of results to display
Recall/precision tradeoff?
Snippet/summary information for each hit
Layout of best bits/other hits
Repetition of the query
“No results” – other possible tips
Iteration and refinement
Other (e.g., scores, clusters, …)
Issues and Tricks
Deploying Search
iSchool
Some example sites
Issues and Tricks
Deploying Search
www.hp.comwww.dell.comwww.ecoearth.infowww.washingtonpost.comwww.dailygazette.comwww.friendsofrockcreek.orgwww.cbf.orgwww.umd.edu
iSchool
Integrating Search and Browsing Provide more navigation for common needs
…based on search logs, other info
Redirect from search results to navigation
Faceted browsing
. . .
iSchool
Faceted Browsing Example
Issues and Tricks
Deploying Search
iSchool
Faceted Browsing Example
Issues and Tricks
Deploying Search
iSchool
Faceted Browsing Example
Demo: http://flamenco.berkeley.edu/demos.html
Issues and Tricks
Deploying Search
iSchool
Advantages of Facets Integrates searching and browsing
Easy to build complex queries
Easy to narrow, broaden, shift focus
Helps users avoid getting lost
Helps to prevent “categorization wars”
Issues and Tricks
Deploying Search
iSchool
Recap Search is an IA issue!
Quality of search results/user experience depends on: Understanding how search engines work Choosing and deploying carefully Constant testing and improvement Time
Tremendous range of parameters/interface choices
Integrating search and browsing/navigation is a very good idea
Issues and Tricks
Deploying Search