modern information retrieval chap. 02: modeling (structured text models)

Modern Information RetrievalModern Information Retrieval

Chap. 02: Modeling Chap. 02: Modeling

(Structured Text Models)(Structured Text Models)

IntroductionIntroduction

• Keyword-based query answering considers Keyword-based query answering considers that the documents are flat i.e., a word in the that the documents are flat i.e., a word in the title has the same weight as a word in the title has the same weight as a word in the body of the document body of the document

• But, the document structure is one additional But, the document structure is one additional piece of information which can be taken piece of information which can be taken advantage ofadvantage of

• For instance, words appearing in the title or For instance, words appearing in the title or in sub-titles within the document could in sub-titles within the document could receive higher importancereceive higher importance

Introduction (cont.)Introduction (cont.)

• Consider the following information need:Consider the following information need:– Retrieve all documents which contain a Retrieve all documents which contain a

page in which the string “atomic holocaust” page in which the string “atomic holocaust” appears in italic in the text surrounding a appears in italic in the text surrounding a Figure whose label contains the word Figure whose label contains the word earthearth

• The corresponding query could be:The corresponding query could be:– same-page( near( italic(“atomic holocaust”), same-page( near( italic(“atomic holocaust”),

Figure( label( “earth” ))))Figure( label( “earth” ))))

Introduction (cont.)Introduction (cont.)

• Advanced interfaces that facilitate the Advanced interfaces that facilitate the specification of the structure are also specification of the structure are also highly desirablehighly desirable

• Models which allow combining Models which allow combining information on text content with information on text content with information on document structure are information on document structure are called called structured text modelsstructured text models

• Structured text models include no Structured text models include no ranking (open research problem) ranking (open research problem)

Basic DefinitionsBasic Definitions

• Match point: Match point: the position in the text of a the position in the text of a sequence of words that match the query sequence of words that match the query – Query: “atomic holocaust in Hiroshima”Query: “atomic holocaust in Hiroshima”– Doc dj: contains 3 lines with this stringDoc dj: contains 3 lines with this string– Then, doc dj contains 3 match pointsThen, doc dj contains 3 match points

• RegionRegion: a contiguous portion of the text: a contiguous portion of the text

• NodeNode: a structural component of the text : a structural component of the text such as a chapter, a section, etcsuch as a chapter, a section, etc

Non-Overlapping Non-Overlapping ListsLists

• Due to Burkowski, 1992.Due to Burkowski, 1992.

• Idea: divide the text in Idea: divide the text in non-overlappingnon-overlapping regions which are collected in a regions which are collected in a list list

• Multiple ways to divide the text in non-Multiple ways to divide the text in non-overlapping parts yield multiple lists:overlapping parts yield multiple lists:– a list for chaptersa list for chapters– a list for sectionsa list for sections– a list for subsectionsa list for subsections

• Text regions from distinct lists might overlapText regions from distinct lists might overlap


L0

L1

L2

Sections

SubSections

SubSubSectionsL3

Chapter


• Implementation:Implementation:– single inverted file that combines single inverted file that combines

keywords and text regionskeywords and text regions– to each entry in this inverted file is to each entry in this inverted file is

associated a list of text regionsassociated a list of text regions– lists of text regions can be merged with lists of text regions can be merged with

lists of keywordslists of keywords


• Regions are non-overlapping which Regions are non-overlapping which limits the queries that can be askedlimits the queries that can be asked

• Types of queries:Types of queries:– select a region that contains a given wordselect a region that contains a given word– select a region A that does not contain a select a region A that does not contain a

region B (regions A and B belong to distinct region B (regions A and B belong to distinct lists) lists)

– select a region not contained within any select a region not contained within any other regionother region


• The non-overlapping lists model is simple The non-overlapping lists model is simple and allows efficient implementationand allows efficient implementation

• But, types of queries that can be asked But, types of queries that can be asked are limitedare limited

• Also, model does not include any Also, model does not include any provision for ranking the documents by provision for ranking the documents by degree of similarity to the querydegree of similarity to the query

• What does structural similarity mean?What does structural similarity mean?

Hybrid ModelHybrid Model• The model sees the database as composed of a The model sees the database as composed of a

set of documents (or files, if no structure is set of documents (or files, if no structure is defined), which may have fields. defined), which may have fields.

• Those fields need not to fully cover the text, and Those fields need not to fully cover the text, and can nest and overlap.can nest and overlap.

• There are a number of operations to obtain There are a number of operations to obtain match points: prefix search, proximity, etc. match points: prefix search, proximity, etc.

• There are operations for union, intersection, There are operations for union, intersection, difference and complement of both documents difference and complement of both documents and match points;and match points;

Hybrid ModelHybrid Model

• to restrict matches to only some fields, and to to restrict matches to only some fields, and to retrieve fields containing some match point. retrieve fields containing some match point.

• Since it is not possible to determine whether a Since it is not possible to determine whether a field is included in other (except under certain field is included in other (except under certain assumptions on the hierarchy) we say that the assumptions on the hierarchy) we say that the model is “flat", model is “flat",

• and since it is not possible to make certain and since it is not possible to make certain compositions of expressions involving fields, compositions of expressions involving fields, we say that it is not “compositional".we say that it is not “compositional".

PAT ExpressionsPAT Expressions• The only index is on match points, there is no The only index is on match points, there is no

indexing on structure. indexing on structure.

• For that purpose, the language allows dynamic For that purpose, the language allows dynamic definition of structures, based on match point definition of structures, based on match point expressions for the beginning and end of expressions for the beginning and end of regions. It also allows to use externally regions. It also allows to use externally computed regions.computed regions.

• Structures can have substructures of other Structures can have substructures of other type; this fact is not explicit, but derived from type; this fact is not explicit, but derived from the inclusion relationship between regions. the inclusion relationship between regions.

PAT ExpressionsPAT Expressions• Recursive structures (e.g. sections having other Recursive structures (e.g. sections having other

sections inside) are not allowed, each structure owns sections inside) are not allowed, each structure owns a set of non-overlapping areas of the text.a set of non-overlapping areas of the text.

• Despite these drawbacks, the model is a good Despite these drawbacks, the model is a good example of structuring and querying documents by example of structuring and querying documents by mixing content and structure. mixing content and structure.

• What is most important, since all operations are What is most important, since all operations are based on the PAT array, they are extremely fast. based on the PAT array, they are extremely fast. Operations on areas are also fast, since they are non-Operations on areas are also fast, since they are non-overlapping and non-recursive.overlapping and non-recursive.

Overlapped ListsOverlapped Lists• The original idea was to have a lists of The original idea was to have a lists of

disjoint segments, originated by disjoint segments, originated by textual searches or by “regions" like textual searches or by “regions" like chapters. chapters.

• It enhances the algebra with It enhances the algebra with overlapping capabilities, some new overlapping capabilities, some new operators and a framework for an operators and a framework for an implementation. implementation.

Overlapped ListsOverlapped Lists• With these enhancements, the model With these enhancements, the model

becomes a reworking of PAT expressions becomes a reworking of PAT expressions that solves elegantly its semantic that solves elegantly its semantic problems. problems.

• The new operators allow to perform set The new operators allow to perform set union, and to combine areas. union, and to combine areas.

• Combination means selecting the minimal Combination means selecting the minimal text areas including two segments, for any text areas including two segments, for any two segments taken from two sets. two segments taken from two sets.

Lists of ReferencesLists of References• Although the structure of documents is hierarchical Although the structure of documents is hierarchical

(with only one strict hierarchy), answers to queries are (with only one strict hierarchy), answers to queries are at (only the top-level elements qualify), and all at (only the top-level elements qualify), and all elements must be from the same type (e.g. only elements must be from the same type (e.g. only sections, or only paragraphs).sections, or only paragraphs).

• Answers to queries are seen as lists of “references". Answers to queries are seen as lists of “references".

• A reference is a pointer to a region of the database. A reference is a pointer to a region of the database. This integrates in an elegant way answers to queries This integrates in an elegant way answers to queries and hypertext links, since all are lists of references. and hypertext links, since all are lists of references.

Lists of ReferencesLists of References• The model has also navigational features to The model has also navigational features to

traverse those lists.traverse those lists.

• This model is very powerful, and because of This model is very powerful, and because of this, has efficiency problems. To make the this, has efficiency problems. To make the model suitable for our comparisons, we model suitable for our comparisons, we consider only the portion related to querying consider only the portion related to querying structures. Even this portion is quite powerful, structures. Even this portion is quite powerful, and allows to efficiently solve queries by first and allows to efficiently solve queries by first locating the text matches and then filtering the locating the text matches and then filtering the candidates with the structural restrictions.candidates with the structural restrictions.

Proximal NodesProximal Nodes

• Due to Navarro and Baeza-Yates, 1997Due to Navarro and Baeza-Yates, 1997

• Idea: define a strict hierarchical index Idea: define a strict hierarchical index over the text. This enrichs the previous over the text. This enrichs the previous model that used flat lists.model that used flat lists.

• Multiple index hierarchies might be Multiple index hierarchies might be defineddefined

• Two distinct index hierarchies might Two distinct index hierarchies might refer to text regions that overlaprefer to text regions that overlap

DefinitionsDefinitions• Each indexing structure is a strict hierarchy Each indexing structure is a strict hierarchy

composed of composed of – chapterschapters– sectionssections– subsectionssubsections– paragraphs paragraphs – lineslines

• Each of these components is called a Each of these components is called a nodenode

• To each node is associated a text regionTo each node is associated a text region


Sections

SubSections

SubSubSections

Chapter

holocaust 10 256 48,324


• Key points:Key points:– In the hierarchical index, one node might be In the hierarchical index, one node might be

contained within another nodecontained within another node– But, two nodes of a same hierarchy cannot But, two nodes of a same hierarchy cannot

overlapoverlap– The inverted list for keywords complements The inverted list for keywords complements

the hierarchical indexthe hierarchical index– The implementation here is more complex The implementation here is more complex

than that for non-overlapping liststhan that for non-overlapping lists

Proximal NodesProximal Nodes• Queries are now regular expressions:Queries are now regular expressions:

– search for stringssearch for strings– references to structural componentsreferences to structural components– combination of thesecombination of these

• Model is a compromise between expressiveness Model is a compromise between expressiveness and efficiencyand efficiency

• Queries are simple but can be processed Queries are simple but can be processed efficientlyefficiently

• Further, model is more expressive than non-Further, model is more expressive than non-overlapping listsoverlapping lists

Proximal NodesProximal Nodes• Query: find the sections, the subsections, Query: find the sections, the subsections,

and the subsubsections that contain the and the subsubsections that contain the word “holocaust”word “holocaust”– [(*section) with (“holocaust”)][(*section) with (“holocaust”)]

• Simple query processing:Simple query processing:– traverse the inverted list for “holocaust” and traverse the inverted list for “holocaust” and

determine all match pointsdetermine all match points– use the match points to search in the use the match points to search in the

hierarchical index for the structural componentshierarchical index for the structural components


• Query: [(*section) with (“holocaust”)]Query: [(*section) with (“holocaust”)]

• Sophisticated query processing:Sophisticated query processing:– get the first entry in the inverted list for “holocaust”get the first entry in the inverted list for “holocaust”– use this match point to search in the hierarchical index use this match point to search in the hierarchical index

for the structural componentsfor the structural components– Innermost matching component: smaller oneInnermost matching component: smaller one– Check if innermost matching component includes the Check if innermost matching component includes the

second entry in the inverted list for “holocaust”second entry in the inverted list for “holocaust”– If it does, check the third entry and so onIf it does, check the third entry and so on– This allows matching efficiently the nearby (or proximal) This allows matching efficiently the nearby (or proximal)

nodes nodes

Proximal NodesProximal Nodes• Model allows formulating queries that are more Model allows formulating queries that are more

sophisticated than those allowed by non-sophisticated than those allowed by non-overlapping listsoverlapping lists

• To speed up query processing, nearby nodes are To speed up query processing, nearby nodes are inspectedinspected

• Types of queries that can be asked are Types of queries that can be asked are somewhat limited (all nodes in the answer must somewhat limited (all nodes in the answer must come from a same index hierarchy!)come from a same index hierarchy!)

• Model is a compromise between efficiency and Model is a compromise between efficiency and expressivenessexpressiveness

Tree MatchingTree Matching• A model relying on a single primitive, A model relying on a single primitive,

tree inclusion, is proposed. tree inclusion, is proposed.

• The idea of tree inclusion is, seeing both The idea of tree inclusion is, seeing both the structure of the database and the the structure of the database and the query (a pattern on structure) as trees, to query (a pattern on structure) as trees, to find an embedding of the pattern into the find an embedding of the pattern into the database which respects the hierarchical database which respects the hierarchical relationships between nodes of the relationships between nodes of the pattern.pattern.

Tree MatchingTree Matching

• forces the embedding to respect the left-to-forces the embedding to respect the left-to-right relations among siblings in the pattern, right relations among siblings in the pattern, while unordered inclusion does not.while unordered inclusion does not.

• Tree inclusion is a way to query on Tree inclusion is a way to query on structural properties in which the user does structural properties in which the user does not need to be aware of all the details of the not need to be aware of all the details of the structure, but only on what he/she is structure, but only on what he/she is interested. This stands for “data interested. This stands for “data independence".independence".

Parsed StringsParsed Strings• The language used to express a The language used to express a

database schema is a context free database schema is a context free grammar, that is, the database is grammar, that is, the database is structured by giving a grammar to structured by giving a grammar to parse its text. The fundamental data parse its text. The fundamental data structure is the p-string, or parsed structure is the p-string, or parsed string, which is composed of a string, which is composed of a derivation tree plus the underlying text.derivation tree plus the underlying text.

Parsed StringsParsed Strings• The parsing process implicitly comprises the The parsing process implicitly comprises the

work of pattern-matching, there are no further work of pattern-matching, there are no further operations to express it.operations to express it.

• There are a number of powerful operations that There are a number of powerful operations that can be performed to manipulate parsed strings: can be performed to manipulate parsed strings: they can be reparsed by another grammar, some they can be reparsed by another grammar, some non terminals can be hidden, etc. non terminals can be hidden, etc.

• The problem is efficiency. Being such a dynamic The problem is efficiency. Being such a dynamic approach, it is hard to implement efficiently.approach, it is hard to implement efficiently.

expressiveness expressiveness analysis analysis

A Taxonomy of A Taxonomy of ModelsModels

• An analysis in three parts: An analysis in three parts: – structuring power, structuring power, – query language query language – efficiency.efficiency.

Structuring PowerStructuring Power

Query LanguageQuery Language

Query Time Query Time ComplexityComplexity

• From the description of the From the description of the implementation of the different implementation of the different

models, we classify them according to models, we classify them according to querying times. We measure the querying times. We measure the

efficiency of a query as a function of efficiency of a query as a function of n, the total size of intermediate n, the total size of intermediate

results, except otherwise specified.results, except otherwise specified.

Query Time Query Time ComplexityComplexity

ConclusionConclusion• No model is the best for all applications, especially No model is the best for all applications, especially

because the more expressive the model, the less because the more expressive the model, the less efficient can it be. efficient can it be.

• Each application has its own set of requirements, and Each application has its own set of requirements, and should select the most efficient model supporting should select the most efficient model supporting them.them.

• Another important issue is the perspective of the user. Another important issue is the perspective of the user. When we incorporate operators and evaluate the cost When we incorporate operators and evaluate the cost of implementing them, we are implicitly assuming of implementing them, we are implicitly assuming that they are useful for the user of the system. that they are useful for the user of the system.

Additional ReadingAdditional Reading

• Integrating Contents and Structure in Integrating Contents and Structure in Text Retrieval paperText Retrieval paper

modern information retrieval chap. 02: modeling (structured text models)

Documents

text content

text node

structured text models

subsections text regions

document structure

query query

nonoverlapping regions

multiple lists