modern information retrieval computer engineering department fall 2005

19
Modern Information Modern Information Retrieval Retrieval Computer engineering Computer engineering department department Fall 2005 Fall 2005

Upload: phyllis-york

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern Information Retrieval Computer engineering department Fall 2005

Modern Information Modern Information Retrieval Retrieval

Computer engineering Computer engineering departmentdepartment

Fall 2005 Fall 2005

SubjectsSubjects

1 Introduction1 Introduction2 Models2 Models3 Retrieval evaluation3 Retrieval evaluation4 Query languages4 Query languages5 Query reformulation5 Query reformulation6 Text properties6 Text properties7 Text languages7 Text languages8 Text processing8 Text processing9 Information retrieval from the Web9 Information retrieval from the Web

SourcesSources

1Modern Information Retrieval1Modern Information RetrievalR Baeza-Yates and R Baeza-Yates and B Ribeiro-Neto Addison-Wesley 1999B Ribeiro-Neto Addison-Wesley 1999

2Information Retrieval Systems Theory and 2Information Retrieval Systems Theory and ImplementationImplementationG Kowalski Kluwer 1997G Kowalski Kluwer 1997

3Automatic Text Processing3Automatic Text ProcessingG Salton Addison-G Salton Addison-Wesley 1989Wesley 1989

4Database System Concepts 4thEdition4Database System Concepts 4thEditionA A Silberschatz H Korth and S Sudarshan Silberschatz H Korth and S Sudarshan McGraw-Hill 2001McGraw-Hill 2001

5Various Web sites5Various Web sites6Original material6Original material

MotivationMotivation

bull IR representation storage organization of IR representation storage organization of and access to information itemsand access to information items

bull Focus is on the Focus is on the user information needuser information needbull Information itemInformation item Usually text but possibly Usually text but possibly

also image audio video etcalso image audio video etcbull Presently most retrieval of non-text items Presently most retrieval of non-text items

is based on searching their textual is based on searching their textual descriptionsdescriptions

bull Text items are often referred to as Text items are often referred to as documents documents and may be of different scope and may be of different scope (book article paragraph etc)(book article paragraph etc)

Motivation (cont)Motivation (cont)

bull User information needUser information needndash Find all docs containing information on college tennis Find all docs containing information on college tennis

teams which (1) are maintained by a USA university teams which (1) are maintained by a USA university and (2) participate in the NCAA tournamentand (2) participate in the NCAA tournament

bull Emphasis is on the retrieval of information Emphasis is on the retrieval of information (not data)(not data)

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 2: Modern Information Retrieval Computer engineering department Fall 2005

SubjectsSubjects

1 Introduction1 Introduction2 Models2 Models3 Retrieval evaluation3 Retrieval evaluation4 Query languages4 Query languages5 Query reformulation5 Query reformulation6 Text properties6 Text properties7 Text languages7 Text languages8 Text processing8 Text processing9 Information retrieval from the Web9 Information retrieval from the Web

SourcesSources

1Modern Information Retrieval1Modern Information RetrievalR Baeza-Yates and R Baeza-Yates and B Ribeiro-Neto Addison-Wesley 1999B Ribeiro-Neto Addison-Wesley 1999

2Information Retrieval Systems Theory and 2Information Retrieval Systems Theory and ImplementationImplementationG Kowalski Kluwer 1997G Kowalski Kluwer 1997

3Automatic Text Processing3Automatic Text ProcessingG Salton Addison-G Salton Addison-Wesley 1989Wesley 1989

4Database System Concepts 4thEdition4Database System Concepts 4thEditionA A Silberschatz H Korth and S Sudarshan Silberschatz H Korth and S Sudarshan McGraw-Hill 2001McGraw-Hill 2001

5Various Web sites5Various Web sites6Original material6Original material

MotivationMotivation

bull IR representation storage organization of IR representation storage organization of and access to information itemsand access to information items

bull Focus is on the Focus is on the user information needuser information needbull Information itemInformation item Usually text but possibly Usually text but possibly

also image audio video etcalso image audio video etcbull Presently most retrieval of non-text items Presently most retrieval of non-text items

is based on searching their textual is based on searching their textual descriptionsdescriptions

bull Text items are often referred to as Text items are often referred to as documents documents and may be of different scope and may be of different scope (book article paragraph etc)(book article paragraph etc)

Motivation (cont)Motivation (cont)

bull User information needUser information needndash Find all docs containing information on college tennis Find all docs containing information on college tennis

teams which (1) are maintained by a USA university teams which (1) are maintained by a USA university and (2) participate in the NCAA tournamentand (2) participate in the NCAA tournament

bull Emphasis is on the retrieval of information Emphasis is on the retrieval of information (not data)(not data)

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 3: Modern Information Retrieval Computer engineering department Fall 2005

SourcesSources

1Modern Information Retrieval1Modern Information RetrievalR Baeza-Yates and R Baeza-Yates and B Ribeiro-Neto Addison-Wesley 1999B Ribeiro-Neto Addison-Wesley 1999

2Information Retrieval Systems Theory and 2Information Retrieval Systems Theory and ImplementationImplementationG Kowalski Kluwer 1997G Kowalski Kluwer 1997

3Automatic Text Processing3Automatic Text ProcessingG Salton Addison-G Salton Addison-Wesley 1989Wesley 1989

4Database System Concepts 4thEdition4Database System Concepts 4thEditionA A Silberschatz H Korth and S Sudarshan Silberschatz H Korth and S Sudarshan McGraw-Hill 2001McGraw-Hill 2001

5Various Web sites5Various Web sites6Original material6Original material

MotivationMotivation

bull IR representation storage organization of IR representation storage organization of and access to information itemsand access to information items

bull Focus is on the Focus is on the user information needuser information needbull Information itemInformation item Usually text but possibly Usually text but possibly

also image audio video etcalso image audio video etcbull Presently most retrieval of non-text items Presently most retrieval of non-text items

is based on searching their textual is based on searching their textual descriptionsdescriptions

bull Text items are often referred to as Text items are often referred to as documents documents and may be of different scope and may be of different scope (book article paragraph etc)(book article paragraph etc)

Motivation (cont)Motivation (cont)

bull User information needUser information needndash Find all docs containing information on college tennis Find all docs containing information on college tennis

teams which (1) are maintained by a USA university teams which (1) are maintained by a USA university and (2) participate in the NCAA tournamentand (2) participate in the NCAA tournament

bull Emphasis is on the retrieval of information Emphasis is on the retrieval of information (not data)(not data)

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 4: Modern Information Retrieval Computer engineering department Fall 2005

MotivationMotivation

bull IR representation storage organization of IR representation storage organization of and access to information itemsand access to information items

bull Focus is on the Focus is on the user information needuser information needbull Information itemInformation item Usually text but possibly Usually text but possibly

also image audio video etcalso image audio video etcbull Presently most retrieval of non-text items Presently most retrieval of non-text items

is based on searching their textual is based on searching their textual descriptionsdescriptions

bull Text items are often referred to as Text items are often referred to as documents documents and may be of different scope and may be of different scope (book article paragraph etc)(book article paragraph etc)

Motivation (cont)Motivation (cont)

bull User information needUser information needndash Find all docs containing information on college tennis Find all docs containing information on college tennis

teams which (1) are maintained by a USA university teams which (1) are maintained by a USA university and (2) participate in the NCAA tournamentand (2) participate in the NCAA tournament

bull Emphasis is on the retrieval of information Emphasis is on the retrieval of information (not data)(not data)

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 5: Modern Information Retrieval Computer engineering department Fall 2005

Motivation (cont)Motivation (cont)

bull User information needUser information needndash Find all docs containing information on college tennis Find all docs containing information on college tennis

teams which (1) are maintained by a USA university teams which (1) are maintained by a USA university and (2) participate in the NCAA tournamentand (2) participate in the NCAA tournament

bull Emphasis is on the retrieval of information Emphasis is on the retrieval of information (not data)(not data)

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 6: Modern Information Retrieval Computer engineering department Fall 2005

Motivation (cont) Data retrieval

which docs contain a set of keywords Well defined semantics a single erroneous object implies

failure Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system interpret contents of information items generate a ranking which reflects

relevance notion of relevance is most important

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 7: Modern Information Retrieval Computer engineering department Fall 2005

Motivation (cont) IR at the center of the stage

IR in the last 20 years classification and categorization systems and languages user interfaces and visualization

Still area was seen as of narrow interest Advent of the Web changed this perception

once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though IR seen as key to

finding the solutions

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 8: Modern Information Retrieval Computer engineering department Fall 2005

Objecttives

1048633Overall objective Minimize search overhead

1048633Measurement of success Precision and recall

1048633Facilitate the overall objective Good search tools Helpful presentation of results

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 9: Modern Information Retrieval Computer engineering department Fall 2005

Minimize search overheadMinimize overheadof a user who is locating needed information Overhead Time spent in all steps leading to the reading of

items containing the needed information (query generation query execution scanning results reading non-relevant items etc)

Needed information Either Sufficient information in the system to complete a task All information in the system relevant to the user needs Example ndashshopping

Looking for an item to purchase Looking for an item to purchase at minimal cost

Example ndashresearching Looking for a bibliographic citation that explains a particular

term Building a comprehensive bibliography on a particular

subject

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 10: Modern Information Retrieval Computer engineering department Fall 2005

Measurement of success

Two dual measures 1048633Precision Proportion of items retrieved that are

relevant Precision = relevant retrieved total retrieved = Answer Relevant Answer

1048633Recall Proportion of relevant items that are retrieved

Recall = relevant retrieved relevant exist = Answer Relevant Relevant

1048633Most popular measures but others exist

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 11: Modern Information Retrieval Computer engineering department Fall 2005

Measurement of success (cont)

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 12: Modern Information Retrieval Computer engineering department Fall 2005

Support user search

Support user search providing tools to overcome obstacles such as

Ambiguities inherent in languages Homographs Words with identical spelling but with multiple

meanings Example ChinonmdashJapanese electronics French chateau

Limits to users ability to express needs Lack of system experience or aptitude

Lack of expertise in the area being searched Initially only vague concept of information sought Differences between users vocabulary and authors

vocabulary different words with similar meanings

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 13: Modern Information Retrieval Computer engineering department Fall 2005

Presentation of results

Present search results in format that helps user determine relevant items

Arbitrary (physical) order Relevance order Clustered (eg conceptual similarity) Graphical (visual) representation

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 14: Modern Information Retrieval Computer engineering department Fall 2005

Basic Concepts The User Task

Retrieval information or data purposeful

Browsing glancing around F1 cars Le Mans France tourism

Retrieval

Browsing

Database

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 15: Modern Information Retrieval Computer engineering department Fall 2005

Querying(retrieval) vs Browsing

Two complementary forms of information or data retrieval

Querying Information need (retrieval goal) is focused

and crystallized Contents of repository are well-known Often user is sophisticated

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 16: Modern Information Retrieval Computer engineering department Fall 2005

Querying(retrieval) vs Browsing (cont)

Browsing bullInformation need (retrieval goal) is vague

and imprecise bullContents of repository are not well-known bullOften user is naive

Querying and browsing are often interleaved (in the same session) bullExample present a query to a search

engine browse in the results restate the original query etc

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 17: Modern Information Retrieval Computer engineering department Fall 2005

Pulling vs Pushing information Querying and browsing are both initiated by users

(information is ldquopulledrdquo from the sources) Alternatively information may be ldquopushedrdquo to

users Dynamically compare newly received items against

standing statements of interests of users (profiles) and deliver matching items to user mail files

Asynchronous (background) process Profile defines all areas of interest (whereas an

individual query focuses on specific question) Each item compared against many profiles

(whereas each query is compared against many items)

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 18: Modern Information Retrieval Computer engineering department Fall 2005

Basic Concepts Logical view of the documents

Document representation viewed as a continuum logical view of docs might shift

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text

Index terms

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
Page 19: Modern Information Retrieval Computer engineering department Fall 2005

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

The Retrieval Process

  • Subjects
  • Sources
  • Motivation
  • Motivation (cont)
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19