information retrieval (1955-1992) primary users –law clerks –reference librarians –(some) news...
DESCRIPTION
Growth of the Web ? # of web sites or Volume of web traffic Mosaic Netscape Volume doubling every 6 months Exponential GrowthTRANSCRIPT
Information Retrieval(1955-1992)• Primary Users
– Law Clerks– Reference Librarians– (Some) News organizations, product research, congressional
committees, medical/chemical abstract searches• Primary Search Models
– Boolean keyword searches on Abstract, Title, keyword• Vendors
– Mead Data Central(Lexis – Nexis)– Dialog– Westlaw– Total searchable online data : O(10 terabytes)
Information Retrieval(1993+)• Primary users
– 1st time computer users– novices
• Primary search modes– Still Boolean keyword searches with limited probabilistic
models– But FULL TEXT Retrieval
• Vendors– Lycos, Infoseek, Yahoo, Excite, AltaVista, Google– Total online data : ???
Growth of the Web
?
1992 1993 1994 1995 1996 1997 1998
# of web sitesor
Volume of webtraffic
Mosaic Netscape
Volume doubling every 6 months
ExponentialGrowth
Observation• Early IR system basically extended library catalog systems, allowing
– Keyword searches,– Limited abstract searches
in addition to Author/Title/Subject and including Boolean combination functionality
• IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)
In ContrastToday, IR has a much wider role in the age of digital libraries
• Full document retrieval
(hypertext, postscript or optical image(TIFF)
representations)
• Question answering
Old ViewFuntion of IR :
Map queries to relevant documents
New View Satisfy user’s information need
Infer goals/information need from:
- query itself
- past user query history
- User profiling(aol.com vs. CS dept.)
- Collective analysis of other user feedback on similar queries
15
1 8
… AND … OR …
In addition, return information in a format useful/intelligible to the user
• weighted orderings
• clusterings of documents by different attributes
• visualization tools
** Text Understanding techniques to extract answer to questions or at least subregion of text
Who is the current mayor of Columbus, Ohio?
don’t need full AP/CNN article on city scandals,
just the answer(and available source for proof)
Boolean SystemsFunction #1 : Provide a fast, compact index into the database (of documents or references)
Chihuahua
Nanny
(granularity)Index options- Doc number- Page number in Doc- Actual word offset
Data structure:Inverted file
Boolean OperationsChihuahua AND Nanny Join ( )
Chihuahua OR Nanny Union ( )
Proximity searches
Chihuhua W/3 Nanny
Vector IR model___________________________________________________________________________________________
d1 d2
f( ) f( )
V1 V2
Find optimal f( )
Sim (Vi , VQ) = Sim’ (Di , Q)Sim (V1, V2) Sim’ (d1 , d2)
Query
Cosine distance
___________________________________________________________________________________________
Vector models
D1
D2
Bit vector capturing essence/meaning of D1
Query
V1
V2
Q1
Find max Sim (Vi , Q1)
Sim (V1 , Q1)
___________________________________________________________________________________________
___________________________________________________________________________________________
Dimensionality Reductiond1
f( )
V1
V1^
Dimensionality Reduction(SVD/LSI)
Initial (term) vector representation
More compact/reduced dimensionality model of d1
___________________________________________________________________________________________
Clustering wordsOffset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w)))
Japanese
JapanNippon Japanese
NihonJapanese
Japanese ..
Raw TermVector
CondensedVector
3
1
192
V1D1
The
5 Japan *
Stem : books bookcomputer computcomputation comput
1V
The soap opera
The soap residue
an opera by Verdi
001
SoapOperaSoap opera
110
SoapOperaSoap opera
d1
Collocation(PhrasalTerm)
d2
Vector Abstractly is a compressed document(meaning preserving)
document
m1 f(d1)
document
m2 f(d2)
Compression : m1 = m2 iff d1 = d2 f( ) must be invertible
Summarization : m1 = m2 iff d1 and d2 are about the same thing(mean the same thing)
A meaningor contextvector representation
………………
………………
What is the optimal method for meaning preserving compression?
Issues
• size of representation(ideally size(Vi) << size(Di))
• cost of computation of vectors
– one time cost at model creation
• cost of similarity function
• must be computed for each query
• crucial to speed that this be minimized
– header processingretain/model cross references
1. remark (most) function words
NOT or 2. downweight by frequency 3. use text analysis +
decide which function words carry meaning.
)(V ref )(V ref VV 332211i
Supervised Learning/Training
Project #1A
Chihuahua Breeding ClubB
PersonalC
Junk mailJ
recognizer
recognizer
recognizer
recognizer
Collective Discrimination
Inputdatastream
In Real time(ongoing)
BAC J
Trai
nin
gLabelled(routed)output
Other related problems: Mail/News Routing and Filtering
DataStream
Project #1 at workProject #2 at workChihuahua breedingScuba clubPersonalJunk mail
Typically model long-term information needs(People put effort into training and user feedback that they aren’t willingto invest for single query-based IR)
Inboxes(prioritize)
119
121
125
131
Features for classification
• Subject line• Source/Sender• X-annotations• Date/time• Length• Other recipients• Message content
(regions weighted differently)
Probabilistic IR models – Intermediate Topic models/detectors
f( )
TDA
TopicDetectors
(TopicModels)
TDB TDE
TV1
S
0 1 0 1 0 0
V1
V2
Q
V1
f( )V2
d1 d2
1 0 0 0 0 0
0 0 0 1 0 0