chapter 4 query languages

26
Chapter 4 Query Languages .. .

Upload: cheung

Post on 13-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Chapter 4 Query Languages. Introduction. Cover different kinds of queries posed to text retrieval systems Keyword-based query languages include simple words and phrases as well as Boolean operators Pattern matching complement keyword searching with data retrieval capabilities - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 4 Query Languages

Chapter 4Query Languages

..

.

Page 2: Chapter 4 Query Languages

Introduction Cover different kinds of queries

posed to text retrieval systems Keyword-based query languages

include simple words and phrases as well as Boolean operators

Pattern matching complement keyword searching with

data retrieval capabilities Structural queries

querying on structure of text

Page 3: Chapter 4 Query Languages

Keyword-Based Querying Query is formulation of user information

need Keyword-based queries are popular

intuitive easy to express allow for fast ranking

Query can be simply a word in general more complex combination of

operations involving several words

Page 4: Chapter 4 Query Languages

Single-Word Queries Most elementary query is a word Word is sequence of letters surrounded by

separators some characters are not letters but do not split

a word, e.g. hyphen in on-line Result of word queries is

set of documents containing at least one of the words in query

resulting documents are ranked term frequency (count of word in document) inverse document frequency (count of no. of

documents in which word appears)

Page 5: Chapter 4 Query Languages

Context Queries Complement single-word queries

with ability to search words in given context, I.e. near other words

words near each other signal higher likelihood of relevance than if they appear apart

form phrases of words or find words which are proximal in text

Page 6: Chapter 4 Query Languages

Phrase Sequence of single-word queries for instance, possible to search for

word ‘enhance’ and then word ‘retrieval’

uninteresting words in text are not considered at all e.g. above example query could match

text such as ‘…enhance the retrieval…’

Page 7: Chapter 4 Query Languages

Proximity Sequence of single words or phrases

is given together with maximum allowed distance between them

For instance, above query stated as ‘enhance’ and ‘retrieval’ should occur

within four words a possible match could be ‘… enhance the

power of retrieval…’ Distance can be measured in characters

or words

Page 8: Chapter 4 Query Languages

Boolean Queries Oldest form of combining keyword

queries is to use Boolean operators Boolean query has following syntax

atoms (I.e. basic queries) that retrieve documents, and of

Boolean operators which work on their operands (sets of documents)

query syntax tree can be defined leaves are basic queries internal nodes are operators

Page 9: Chapter 4 Query Languages

Boolean Queries (Cont.)

AND

translation OR

Retrieve all documents that contain the word ‘translation’ as well as either the word ‘syntax’ or the word ‘syntactic’

syntax syntactic

Page 10: Chapter 4 Query Languages

Boolean Queries (Cont.) No ranking of retrieved documents

provided document either satisfies query

(retrieved) or does not (not retrieved) does not allow partial matching between

document and user query to overcome this limitation, idea of ‘fuzzy

Boolean’ set of operators proposed instead of all the operands (AND) or at

least in one of operands (OR), retrieve elements in some operands

Page 11: Chapter 4 Query Languages

Natural Language Distinction between AND and OR

completely blurred simply an enumeration of words and

context queries all documents matching portion of user

query are retrieved higher ranking assigned to documents

matching more parts of query eliminated any reference to Boolean

operators

Page 12: Chapter 4 Query Languages

Pattern Matching Query formulation based on concept of pattern that allow retrieval of pieces of text that have some property

Pattern is set of syntactic features that occur in text segment

Segments satisfying pattern specification said to ‘match’ the pattern

We are interested in documents containing segments that match given search pattern

Page 13: Chapter 4 Query Languages

Pattern Matching (Cont.) Most used types of pattern are

words string (sequence of characters) that is a word

in text

prefixes string that form beginning of text word prefix ‘comput’ retrieve documents with words

such as ‘computer’, ‘computation’

suffixes string that form termination of word suffix ‘ters’ retrieve documents with words

such as ‘testers’, ‘computers’

Page 14: Chapter 4 Query Languages

substrings string which can appear within word substring ‘tal’ retrieve documents with words

such as ‘coastal’, ‘talk’, ‘metallic’

ranges pair of strings that match any word lying

between them in lexicographical order alphabets sorted to order string into

lexicographical order (dictionary order) range between words ‘held’ and ‘hold’ retrieve

strings such as ‘hoax’, ‘hissing’

Page 15: Chapter 4 Query Languages

allowing errors word together with error threshold retrieves all text words ‘similar’ to given word pattern may have errors (typing, spelling) and

documents with words with erroneous variants are retrieved (with edit distance)

if typing error splits ‘flower’ into ‘flo wer’, still found with one error

regular expression (r.e.) r.e. is built up by simple strings and operators

like union, concatenation and repetition query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’

will match words like ‘problem02’, ‘proteins’

Page 16: Chapter 4 Query Languages

Extended patterns subset of regular expressions expressed with

simpler syntax classes of characters, I.e. some position in

pattern matched by any character from pre-defined set (e.g. some characters must be digit, not a letter, vowel, etc.)

conditional expressions, I.e. part of pattern may or may not appear

wild characters, I.e. match any sequence in text (e.g. any word starts as ‘flo’ and ends with ‘ers’ which match ‘flowers’ as well as ‘flounders’

Page 17: Chapter 4 Query Languages

Structural Queries Allowing user to query texts based on

structure, and not content mixing contents and structure in queries

can pose powerful queries (much more expressive)

An example select set of documents that satisfy certain

constraints on content (using word, phrase, or patterns) and then

structural constraints expressed using containment, proximity, or chapters, sections present in documents

Page 18: Chapter 4 Query Languages

Types of structures

fixed structure

hypertext

hierarchical structure

Page 19: Chapter 4 Query Languages

Fixed Structure Document has fixed set of fields each field has some text inside some fields not present in all documents fields not allowed to nest or overlap retrieval activity restricted to specifying

that given pattern was to be found only in given fields

this model reasonable when text collection has fixed structure

Page 20: Chapter 4 Query Languages

Hypertext Retrieval from hypertext began as

navigational activity user manually traverse hypertext nodes

following links to search what he wanted not possible to query hypertext based on

its structure WebGlimpse - interesting proposal to

allow navigation plus ability to search by content in neighborhood of current node

Page 21: Chapter 4 Query Languages

Hierarchical Structure Represent recursive decomposition of

text natural model for many text collections Figure 4.3 shows example of hierarchical

structure that consists of page of a book, its schematic view and parsed query to retrieve figure

Page 22: Chapter 4 Query Languages

Hierarchical ModelsPAT Expressions Structure marked in the text by tags (as

in HTML) defined in terms of initial and final tags

each pair of initial and final tags defines a region, set of contiguous text areas area of region cannot nest or overlap

possible to select areas containing other areas, contained in other areas, or followed by other areas

Page 23: Chapter 4 Query Languages

Overlapped Lists Allows area of regions to overlap, but not

to nest considers use of inverted list where

words and also regions are indexed allows to perform set union, and to

combine regions

Page 24: Chapter 4 Query Languages

List of References Attempt to make definition and querying

of structured text uniform, using common language

the language allows for querying on ‘path expressions’, which describe paths in structure tree

answers to queries are list of ‘references’ reference is pointer to region of database

Page 25: Chapter 4 Query Languages

Proximal Nodes Tries to find good compromise between

expressiveness and efficiency specifies fully compositional language

where leaves of query syntax tree formed by basic queries on contents or names of structural elements (e.g. all chapters)

internal nodes combine results for efficiency, operations at internal

nodes must relate nodes close in text

Page 26: Chapter 4 Query Languages

Tree Matching Relies on single primitive: tree inclusion interpret structure of text database and

query as trees determine embedding of query into

database which respects hierarchical relationships between nodes of query

simple queries return roots of the matches