information retrieval (1955-1992) primary users –law clerks –reference librarians –(some) news...

Information Retrieval(1955-1992)• Primary Users

– Law Clerks– Reference Librarians– (Some) News organizations, product research, congressional

committees, medical/chemical abstract searches• Primary Search Models

– Boolean keyword searches on Abstract, Title, keyword• Vendors

– Mead Data Central(Lexis – Nexis)– Dialog– Westlaw– Total searchable online data : O(10 terabytes)

Information Retrieval(1993+)• Primary users

– 1st time computer users– novices

• Primary search modes– Still Boolean keyword searches with limited probabilistic

models– But FULL TEXT Retrieval

• Vendors– Lycos, Infoseek, Yahoo, Excite, AltaVista, Google– Total online data : ???

Growth of the Web

?

1992 1993 1994 1995 1996 1997 1998

# of web sitesor

Volume of webtraffic

Mosaic Netscape

Volume doubling every 6 months

ExponentialGrowth

Observation• Early IR system basically extended library catalog systems, allowing

– Keyword searches,– Limited abstract searches

in addition to Author/Title/Subject and including Boolean combination functionality

• IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand)

In ContrastToday, IR has a much wider role in the age of digital libraries

• Full document retrieval

(hypertext, postscript or optical image(TIFF)

representations)

• Question answering

Old ViewFuntion of IR :

Map queries to relevant documents

New View Satisfy user’s information need

Infer goals/information need from:

- query itself

- past user query history

- User profiling(aol.com vs. CS dept.)

- Collective analysis of other user feedback on similar queries

15

1 8

… AND … OR …

In addition, return information in a format useful/intelligible to the user

• weighted orderings

• clusterings of documents by different attributes

• visualization tools

** Text Understanding techniques to extract answer to questions or at least subregion of text

Who is the current mayor of Columbus, Ohio?

don’t need full AP/CNN article on city scandals,

just the answer(and available source for proof)

Boolean SystemsFunction #1 : Provide a fast, compact index into the database (of documents or references)

Chihuahua

Nanny

(granularity)Index options- Doc number- Page number in Doc- Actual word offset

Data structure:Inverted file

Boolean OperationsChihuahua AND Nanny Join ( )

Chihuahua OR Nanny Union ( )

Proximity searches

Chihuhua W/3 Nanny

Vector IR model___________________________________________________________________________________________

d1 d2

f( ) f( )

V1 V2

Find optimal f( )

Sim (Vi , VQ) = Sim’ (Di , Q)Sim (V1, V2) Sim’ (d1 , d2)

Query

Cosine distance

___________________________________________________________________________________________

Vector models

D1

D2

Bit vector capturing essence/meaning of D1

Query

V1

V2

Q1

Find max Sim (Vi , Q1)

Sim (V1 , Q1)

___________________________________________________________________________________________

___________________________________________________________________________________________

Dimensionality Reductiond1

f( )

V1

V1^

Dimensionality Reduction(SVD/LSI)

Initial (term) vector representation

More compact/reduced dimensionality model of d1

___________________________________________________________________________________________

Clustering wordsOffset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w)))

Japanese

JapanNippon Japanese

NihonJapanese

Japanese ..

Raw TermVector

CondensedVector

3

1

192

V1D1

The

5 Japan *

Stem : books bookcomputer computcomputation comput

1V

The soap opera

The soap residue

an opera by Verdi

001

SoapOperaSoap opera

110

SoapOperaSoap opera

d1

Collocation(PhrasalTerm)

d2

Vector Abstractly is a compressed document(meaning preserving)

document

m1 f(d1)

document

m2 f(d2)

Compression : m1 = m2 iff d1 = d2 f( ) must be invertible

Summarization : m1 = m2 iff d1 and d2 are about the same thing(mean the same thing)

A meaningor contextvector representation

………………

………………

What is the optimal method for meaning preserving compression?

Issues

• size of representation(ideally size(Vi) << size(Di))

• cost of computation of vectors

– one time cost at model creation

• cost of similarity function

• must be computed for each query

• crucial to speed that this be minimized

– header processingretain/model cross references

1. remark (most) function words

NOT or 2. downweight by frequency 3. use text analysis +

decide which function words carry meaning.

)(V ref )(V ref VV 332211i

Supervised Learning/Training

Project #1A

Chihuahua Breeding ClubB

PersonalC

Junk mailJ

recognizer

recognizer

recognizer

recognizer

Collective Discrimination

Inputdatastream

In Real time(ongoing)

BAC J

Trai

nin

gLabelled(routed)output

Other related problems: Mail/News Routing and Filtering

DataStream

Project #1 at workProject #2 at workChihuahua breedingScuba clubPersonalJunk mail

Typically model long-term information needs(People put effort into training and user feedback that they aren’t willingto invest for single query-based IR)

Inboxes(prioritize)

119

121

125

131

Features for classification

• Subject line• Source/Sender• X-annotations• Date/time• Length• Other recipients• Message content

(regions weighted differently)

Probabilistic IR models – Intermediate Topic models/detectors

f( )

TDA

TopicDetectors

(TopicModels)

TDB TDE

TV1

S

0 1 0 1 0 0

V1

V2

Q

V1

f( )V2

d1 d2

1 0 0 0 0 0

0 0 0 1 0 0

information retrieval (1955-1992) primary users –law clerks –reference librarians –(some) news...

Documents