sdpl 2001notes 8:1 region algebra1 8. querying structured text n how to query the structure and...

45
SDPL 2001 Notes 8:1 Region Algebra 1 8. Querying Structured 8. Querying Structured Text Text How to query the structure and contents How to query the structure and contents of structured documents? of structured documents? 8.1 Text-region algebra, and search tool sgrep 8.1 Text-region algebra, and search tool sgrep » a simple model and tool for a simple model and tool for content retrieval content retrieval of arbitrary structured text files, based on of arbitrary structured text files, based on concrete document syntax concrete document syntax 8.2 W3C XML query language XQuery 8.2 W3C XML query language XQuery » a rich a rich query and data manipulation language query and data manipulation language for for all types of XML data sources, based on all types of XML data sources, based on conceptual content of XML documents ("XML conceptual content of XML documents ("XML Information Set") Information Set")

Upload: diana-osborne

Post on 26-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 1

8. Querying Structured Text8. Querying Structured Text

How to query the structure and contents of How to query the structure and contents of structured documents?structured documents?8.1 Text-region algebra, and search tool sgrep8.1 Text-region algebra, and search tool sgrep

» a simple model and tool for a simple model and tool for content retrievalcontent retrieval of arbitrary of arbitrary structured text files, based on concrete document syntaxstructured text files, based on concrete document syntax

8.2 W3C XML query language XQuery8.2 W3C XML query language XQuery » a rich a rich query and data manipulation languagequery and data manipulation language for all for all

types of XML data sources, based on conceptual content types of XML data sources, based on conceptual content of XML documents ("XML Information Set")of XML documents ("XML Information Set")

Page 2: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 2

8.1 Region algebra (and sgrep)8.1 Region algebra (and sgrep)

the model: the model: region algebraregion algebra– relatively low-level but appropriate for e.g. XMLrelatively low-level but appropriate for e.g. XML– documents seen as contiguous portions called documents seen as contiguous portions called regionsregions

an implementation: an implementation: sgrepsgrep– ““structured grep”, command-line based search toolstructured grep”, command-line based search tool

» version 0.99 (Unix/Linux): basic featuresversion 0.99 (Unix/Linux): basic features» version 1.92a (Unix/Linux/Win32): indexing, version 1.92a (Unix/Linux/Win32): indexing,

SGML/XML/HTML support, other additional featuresSGML/XML/HTML support, other additional features– extracts portions of files/input streams extracts portions of files/input streams

Page 3: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 3

Region algebra: BackgroundRegion algebra: Background

PatPatTMTM (University of Waterloo; OpenText) (University of Waterloo; OpenText)– efficient full-text index (suffix array)efficient full-text index (suffix array)– match-point delimited regionsmatch-point delimited regions

» result sets of non-overlapping regions only; result sets of non-overlapping regions only; overlapping results represented as their start points overlapping results represented as their start points semantic problems semantic problems

‘‘‘‘generalized concordance lists’’ generalized concordance lists’’ – Clarke, Cormack and Burkowski 1995Clarke, Cormack and Burkowski 1995– overlaps but no nesting in resultsoverlaps but no nesting in results

Page 4: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 4

Nested region algebraNested region algebra

Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999 retrieval of document components as dynamically retrieval of document components as dynamically

defined regions, based on their mutual ordering defined regions, based on their mutual ordering and nestingand nesting

no restrictions on the length, overlapping or no restrictions on the length, overlapping or nesting of regionsnesting of regions

no restrictions on document formatsno restrictions on document formats– regular markup formats (a’la XML) most appropriateregular markup formats (a’la XML) most appropriate

Page 5: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 5

Basics of region algebra data modelBasics of region algebra data model

Set of Set of positionspositions U= U= {0, …, {0, …, nn-1} comprising a text -1} comprising a text of length of length nn– characters/bytes (sgrep), or wordscharacters/bytes (sgrep), or words

RegionsRegions: : contiguous (non-empty) sequences of contiguous (non-empty) sequences of positions (normally fragments of files)positions (normally fragments of files)– typically occurrences of a string,typically occurrences of a string,

or document elements or document elements– regionregion a a = { = {s, ss, s+1+1…, e…, e} denoted by (} denoted by (s, es, e) )

or (or (a.s, a.ea.s, a.e), by the ), by the startstart and and end positionend position of region of region aa

Page 6: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 6

Positions in sgrepPositions in sgrep

Sample file (under Linux):Sample file (under Linux):% cat abra.txt% cat abra.txtabraabracadcadabraabra$$

Regions normally occurrences of a string:Regions normally occurrences of a string:% sgrep -o"(%s,%e)" '% sgrep -o"(%s,%e)" '"abra""abra"' abra.txt' abra.txt(0,3)(7,10)(0,3)(7,10)– output format template output format template "(%s,%e)" "(%s,%e)" applied to each applied to each

result region ("insert start position and end position in result region ("insert start position and end position in parentheses")parentheses")

Page 7: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 7

Positions in sgrep (2)Positions in sgrep (2)

The first position:The first position:% sgrep '% sgrep 'startstart' abra.txt' abra.txtaa

The last position:The last position:% sgrep '% sgrep 'endend' abra.txt' abra.txt$$

The entire document (more on operator 'The entire document (more on operator '....'' later):later):% sgrep 'start % sgrep 'start .... end' abra.txt end' abra.txtabracadabra$abracadabra$

Page 8: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 8

Relationships of regions Relationships of regions aa and and bb

Language based on relationships between Language based on relationships between regions:regions:

1)1) a a precedesprecedes b, b b, b followsfollows a a::– a a ends beforeends before b b starts,starts, a.e < b.s a.e < b.s

2) 2) a a is is included inincluded in b b ( (b b containscontains a): a a): a bb ( (b b aa):):– b.s b.s a.sa.s, and , and a.e a.e b.eb.e– proper containmentproper containment: : a a bb, and , and b b aa

3) if neither of the above, a and b 3) if neither of the above, a and b overlapoverlap

Page 9: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 9

Region algebraRegion algebra

Region algebra is a set-valued languageRegion algebra is a set-valued language– the value of any expression is a set of regionsthe value of any expression is a set of regions– c.f. relational algebra: set of rows/tuplesc.f. relational algebra: set of rows/tuples

Operations map Operations map region setsregion sets to to region sets region sets a a compositional compositional language: operands of language: operands of expressions can be any other expressionsexpressions can be any other expressions

arbitrarily complex queries can be formed arbitrarily complex queries can be formed

Page 10: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 10

Building new regionsBuilding new regions

Nested regions (e.g. document elements)Nested regions (e.g. document elements)– ‘‘‘‘followed-by’’ operatorsfollowed-by’’ operators

Non-nested regionsNon-nested regions– ‘‘‘‘quote’’ operatorsquote’’ operators

By removing overlap with other regionsBy removing overlap with other regions– ‘‘‘‘extracting’’ operatorextracting’’ operator

Page 11: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 11

‘‘‘‘Followed-by’’Followed-by’’

The most central and characteristic operator in The most central and characteristic operator in nested region algebra and sgrepnested region algebra and sgrep

allows dynamic creation of regions from their allows dynamic creation of regions from their bounding regionsbounding regions

generalises the way how parentheses are generalises the way how parentheses are paired together, starting from inside and paired together, starting from inside and always matching the closest unmatched onesalways matching the closest unmatched ones

Page 12: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 12

Followed-by exampleFollowed-by example

Consider a parenthesised expression:Consider a parenthesised expression:% cat expr.txt% cat expr.txt(1(2(3))(4))(1(2(3))(4))

Extract parenthesised sub-expressions:Extract parenthesised sub-expressions:% sgrep -o"%r% sgrep -o"%r////" '"(" " '"(" .... ")"' expr.txt ")"' expr.txt(1(2(3))(4))(1(2(3))(4))////(2(3))(2(3))////(3)(3)////(4)(4)////

NB: by default, without theNB: by default, without the -o-o output format output format specification, sgrep outputs each position covered specification, sgrep outputs each position covered by result regions only onceby result regions only once

Page 13: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 13

Value of Value of A .. BA .. B formally formally

A A andand B B sets of regionssets of regions AA .. .. BB = {( = {(a.s, b.ea.s, b.e)} )} UU ( ( ( (AA-{-{aa}) .. (}) .. (BB-{-{bb}) ),}) ),

where where a a A, b A, b BB such that such that a.e < b.sa.e < b.s and the and the a.e .. b.sa.e .. b.s distance is minimal distance is minimal – (and also distance btw (and also distance btw a.s a.s andand b.e b.e is minimal, if there is minimal, if there

are otherwise equidistant pairs)are otherwise equidistant pairs) – empty set, if there are no such empty set, if there are no such a a A A andand b b B B

each each a a A A and eachand each b b B B produces at most produces at most one result region (one result region (a.s, b.ea.s, b.e))

Page 14: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 14

Regions of document elementsRegions of document elements

Regions delimited by start and end tags:Regions delimited by start and end tags:% sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html% sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html<TITLE> sgrep - Manual page </TITLE><TITLE> sgrep - Manual page </TITLE>

Delimiting regions can be left-out:Delimiting regions can be left-out:– using also the built-in markup scanner to recognise tags: using also the built-in markup scanner to recognise tags:

% sgrep '% sgrep 'stag("TITLE")stag("TITLE") _._. etag("TITLE")etag("TITLE")' sgrepman.html ' sgrepman.html sgrep - Manual page </TITLE> sgrep - Manual page </TITLE>% sgrep 'stag("TITLE") % sgrep 'stag("TITLE") ._._ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html <TITLE> sgrep - Manual page <TITLE> sgrep - Manual page% sgrep 'stag("TITLE") % sgrep 'stag("TITLE") ____ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html sgrep - Manual page sgrep - Manual page

(possible empty regions are excluded from (possible empty regions are excluded from A __ BA __ B) )

Page 15: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 15

‘‘‘‘Quote’’ operatorsQuote’’ operators

Documents typically contain also disjoint regions, Documents typically contain also disjoint regions, often delimited by identical start and end markersoften delimited by identical start and end markers– e.g. string constants in programming languagese.g. string constants in programming languages

% cat hello.txt% cat hello.txt"Hello," said Bob, "nice day! ""Hello," said Bob, "nice day! "% sgrep '"\"" % sgrep '"\"" quotequote "\""' hello.txt "\""' hello.txt "Hello,""nice day!" "Hello,""nice day!"

– starting or ending regions, or both, can be excluded starting or ending regions, or both, can be excluded (similarly to(similarly to _., ._ _., ._ and and ____) using) using _quote_quote, , quote_quote_ and and _quote__quote_

Page 16: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 16

Extracting overlapping regionsExtracting overlapping regions

A A extractingextracting B B = regions that result by removing = regions that result by removing from regions of from regions of AA any overlap with regions in any overlap with regions in BB

Contents of sections without subsections:Contents of sections without subsections:

sgrep '"<sect>" .. "</sect>" sgrep '"<sect>" .. "</sect>" extractingextracting ("<subsect>" .. "</subsect>")' ex.xml("<subsect>" .. "</subsect>")' ex.xml

Content without markup tags:Content without markup tags:

sgrep 'start .. end sgrep 'start .. end extractingextracting ("<" ..">")' *.html("<" ..">")' *.html

Note: No operator precedence; Parenthesised sub-Note: No operator precedence; Parenthesised sub-expressions are evaluated first, otherwise left-to-rightexpressions are evaluated first, otherwise left-to-right

Page 17: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 17

Containment conditionsContainment conditions

Allow selecting regions based on their Allow selecting regions based on their context or contentcontext or content

Operators for selecting regions that Operators for selecting regions that – appear / do not appear in a given contextappear / do not appear in a given context– contain / do not contain a region of another setcontain / do not contain a region of another set

Similar operators in other variations of Similar operators in other variations of region algebraregion algebra

Page 18: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 18

Containment operators formallyContainment operators formally

A A inin B: { B: {aaAA b b BB:: a a b b

A A not innot in B: { B: {aaAA b b BB:: a a b b

A A containingcontaining B: { B: {aaAA b b BB:: a a b b

A A not containingnot containing B: { B: {aaAA b b BB:: a abb

Page 19: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 19

Set operationsSet operations

Union, intersection and difference:Union, intersection and difference: A A oror B, A B, A equalequal B, A B, A not equalnot equal B B

Rather seldom needed (except for ‘Rather seldom needed (except for ‘oror’);’);containment conditions otherwise sufficient containment conditions otherwise sufficient for expressing Boolean retrieval for expressing Boolean retrieval (See next)(See next)

Page 20: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 20

Expressing Boolean queriesExpressing Boolean queries

HTML files (HTML files (%f%f) containing “cat” AND “dog”:) containing “cat” AND “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end

containingcontaining "cat" "cat" containingcontaining "dog"' *.html "dog"' *.html

HTML files containing “cat” or “dog”:HTML files containing “cat” or “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end

containingcontaining ("cat" ("cat" oror "dog")' *.html "dog")' *.html

HTML files containing “cat” but no “dog”:HTML files containing “cat” but no “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end containingcontaining "cat" "cat" not containingnot containing "dog"' *.html "dog"' *.html

Page 21: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 21

Structural retrievalStructural retrieval

Arbitrarily complex queries, e.g., ‘‘extract titles of Arbitrarily complex queries, e.g., ‘‘extract titles of pages which mention cats, but not in the title of the pages which mention cats, but not in the title of the page’’:page’’: sgrep '"<TITLE>" .. "</TITLE>" in sgrep '"<TITLE>" .. "</TITLE>" in (start .. end containing (start .. end containing ("cat" not in ("<TITLE>" .. "</TITLE>"))' ("cat" not in ("<TITLE>" .. "</TITLE>"))' *.html *.html

But notice: The model supports (and sgrep implements) But notice: The model supports (and sgrep implements) only extracting the regions that satisfy the query, in orderonly extracting the regions that satisfy the query, in order– with restricted modification of each region (usingwith restricted modification of each region (using -o) -o)

Page 22: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 22

Additional operations on region setsAdditional operations on region sets

concatconcat(A): minimal set of regions covering (A): minimal set of regions covering exactly the regions of Aexactly the regions of A– default result formatting of sgrepdefault result formatting of sgrep

innerinner(A) = A not containing A(A) = A not containing A– the innermost regions in Athe innermost regions in A

outerouter(A) = A not in A(A) = A not in A– the outermost regions in Athe outermost regions in A– e.g. to get the document root element:e.g. to get the document root element:

outerouter((elementselements))

Page 23: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 23

Sgrep: some useful optionsSgrep: some useful options

-a-a: Display result regions surrounded by the rest of : Display result regions surrounded by the rest of the file; useful with the file; useful with -o-o (below) (below)

-o-o <style><style>: Set output format string, possibly : Set output format string, possibly containing following place holders:containing following place holders:– %f%f: name of the file containing the start of region,: name of the file containing the start of region,– %s, %e%s, %e: start and end position,: start and end position,– %r%r: the content (text) of the region,: the content (text) of the region,– %n%n: the ordinal number of the region, (+ few others); : the ordinal number of the region, (+ few others);

Output once for each result region, with current values Output once for each result region, with current values substituted for place holderssubstituted for place holders

Page 24: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 24

Sgrep: more useful optionsSgrep: more useful options

-c -c display only count of matching regions display only count of matching regions -h-h a short help (list of options) a short help (list of options) -f file-f file read commands from read commands from filefile -i -i ignore case distinctions in phrases ignore case distinctions in phrases -S-S stream mode (regions extend across files) stream mode (regions extend across files) -T/-t-T/-t show statistics about things done/time show statistics about things done/time

spentspent For more, see the man page and READMEFor more, see the man page and README

Page 25: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 25

Applications of sgrepApplications of sgrep

Simple document assembly Simple document assembly – Jaakkola and Kilpeläinen: Using sgrep for querying Jaakkola and Kilpeläinen: Using sgrep for querying

structured text files, structured text files, SGML Finland’96SGML Finland’96

Analysing element structures Analysing element structures As a Web site search engine As a Web site search engine

– E.g. Chapter 7 in Leventhal, Lewis & Fuchs: E.g. Chapter 7 in Leventhal, Lewis & Fuchs: Designing Designing XML Internet ApplicationsXML Internet Applications

– Using sgrep indexer to speed up querying static filesUsing sgrep indexer to speed up querying static files

Page 26: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 26

An assembly system prototypeAn assembly system prototype

Page 27: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 27

Analysing element structuresAnalysing element structures

Structure of element instances in a collection? Structure of element instances in a collection? The DTD does not tell!The DTD does not tell!

Element statistics:Element statistics: lu 1751 (min: 67, avg: 6849.37, max: 115620)lu 1751 (min: 67, avg: 6849.37, max: 115620) 41 -> 98 huom 41 -> 98 huom 1751 -> 1751 nu 1751 -> 1751 nu 1748 -> 1748 ot 1748 -> 1748 ot

1751 1751 lulu elements, with shown minimum, average and elements, with shown minimum, average and maximum lenghts; 41 contain maximum lenghts; 41 contain huomhuom elements directly, and elements directly, and 98 98 huomhuom elements are children ofelements are children of lulu elementselements..

Page 28: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 28

Structure analysis: implementation Structure analysis: implementation

Generated by a Python script Generated by a Python script each datum (element type name, count, each datum (element type name, count,

length) computed by generating and length) computed by generating and executing an sgrep queryexecuting an sgrep query

Page 29: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 29

Implementation of sgrepImplementation of sgrep

Steps in executing an sgrep query?Steps in executing an sgrep query?1. Preprocessing 1. Preprocessing

» macro expansion; external & optionalmacro expansion; external & optional

2. Query parsing2. Query parsing

3. Query optimisation3. Query optimisation

4. String retrieval on the text4. String retrieval on the text

5. Operator evaluation5. Operator evaluation

6. Data delivery6. Data delivery» Outputting of result regionsOutputting of result regions

Page 30: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 30

1. Preprocessing 1. Preprocessing

Consider the below query Q:Consider the below query Q:sgrep -f macros -e'outer(S in S)' a.xml b.xmlsgrep -f macros -e'outer(S in S)' a.xml b.xml

With suitable macro definitions Q becomes With suitable macro definitions Q becomes outer(("<sec>" .. "</sec>") in outer(("<sec>" .. "</sec>") in

("<sec>" .. "</sec>"))("<sec>" .. "</sec>"))

m 4 preprocessormacros

expanded query

Q

Page 31: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 31

2. Query parsing2. Query parsing

– Query into an Query into an operator tree:operator tree:» operators as internal nodes, string phrases as operators as internal nodes, string phrases as

leavesleaves

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>" "<sec>""<sec>"

....

"</sec>""</sec>"

Page 32: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 32

3. Query optimisation 3. Query optimisation

– common subexpression eliminationcommon subexpression elimination– operator tree to an operator tree to an operator graphoperator graph (DAG) (DAG)

» e.g, an operator tree of 735 nodes (for BibTeX e.g, an operator tree of 735 nodes (for BibTeX records) reduces to a DAG of 103 nodesrecords) reduces to a DAG of 103 nodes

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"

Page 33: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 33

4. String retrieval (1)4. String retrieval (1)

– A deterministic Aho-Corasick automaton A deterministic Aho-Corasick automaton M M built of string patterns in the querybuilt of string patterns in the query

M:

>s e c

>s e c<

/

Page 34: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 34

String retrieval (2)String retrieval (2)

– Input files (or streams) scanned using Input files (or streams) scanned using MM» each pattern simultaneouslyeach pattern simultaneously» region list region list of pattern occurrences attached to of pattern occurrences attached to

the leaves of the operator graphthe leaves of the operator graphouterouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]

Alternatively, region lists can be Alternatively, region lists can be obtained by a look-up from a obtained by a look-up from a pre-computed index (sgrep pre-computed index (sgrep 1.92), without scanning target 1.92), without scanning target filesfiles

Page 35: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 35

5. Operator evaluation5. Operator evaluation

Bottom-up traversal of the operator graph Bottom-up traversal of the operator graph – value of node: a value of node: a region listregion list

» of (start, end) index pairsof (start, end) index pairs» internal representation of region setsinternal representation of region sets» maintained in increasing (start, end) ordermaintained in increasing (start, end) order

– each node evaluated onceeach node evaluated once– (sub)queries with large results may require lots (sub)queries with large results may require lots

of main memory!of main memory!

Page 36: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 36

Operator graph evaluationOperator graph evaluation

outerouter

inin

"<sec>""<sec>"

....

"</sec>""</sec>"

[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]

[(5,25), (26,105), …][(5,25), (26,105), …]

[(1000,1500), (1010,1200), …][(1000,1500), (1010,1200), …]

[(1000,1500), (1800,1923), …][(1000,1500), (1800,1923), …]

Page 37: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 37

6. Data delivery6. Data delivery

Finally, the regions in the result region list Finally, the regions in the result region list are output in document orderare output in document order– text of regions retrieved by indexing target files text of regions retrieved by indexing target files

by start and end positions of the result regionsby start and end positions of the result regions– result regions possibly modified according to result regions possibly modified according to

the output style specification (given with option the output style specification (given with option -o-o))

Page 38: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 38

Evaluating region algebra operatorsEvaluating region algebra operators

Synchronised (merge-like, linear) traversal Synchronised (merge-like, linear) traversal of operand lists, of operand lists, except that ...except that ...– A A .... B B requires sorting of requires sorting of AA (by region end (by region end

positions) if positions) if AA contains nested regions contains nested regions– A A extractingextracting B B may require considering the may require considering the

same regions of A multiple times same regions of A multiple times

Page 39: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 39

Evaluation complexityEvaluation complexity

Worst-case time Worst-case time OO((t nt n), where ), where nn maximum maximum size and size and t t maximum ''thickness'' (number of maximum ''thickness'' (number of regions overlapping at any position) of any regions overlapping at any position) of any region setregion set– Jaakkola and Kilpeläinen: Nested Text-Region Jaakkola and Kilpeläinen: Nested Text-Region

Algebra, January 1999.Algebra, January 1999. In practise linear time;In practise linear time; comparable to Unix grepscomparable to Unix greps

Page 40: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 40

Extended features in sgrep 2Extended features in sgrep 2

(pre-release v. 1.92a)(pre-release v. 1.92a) documented (only!) in the README filedocumented (only!) in the README file XML/HTML/SGML supportXML/HTML/SGML support

– scanner to recognise markup tokensscanner to recognise markup tokens » SGML/HTML-mode (default)SGML/HTML-mode (default)

markup names converted to UPPER CASE markup names converted to UPPER CASE

» XML-mode: case-sensitive retrieval of tag namesXML-mode: case-sensitive retrieval of tag names– simple-minded parser to recognise elementssimple-minded parser to recognise elements– 16-bit wide characters in XML documents16-bit wide characters in XML documents– NBNB: currently no expanding of entity references; : currently no expanding of entity references;

No validation or well-formedness checkingNo validation or well-formedness checking

Page 41: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 41

Extended sgrep features (2)Extended sgrep features (2)

Direct containment:Direct containment:– A A childreningchildrening B, A B, A parentingparenting B B

Restricting the number of result regionsRestricting the number of result regions– first(n, E), last(n, E) first(n, E), last(n, E)

Restricting the length of result regionsRestricting the length of result regions– first_bytes(n, E), last_bytes(n, E)first_bytes(n, E), last_bytes(n, E)

Nearness operatorsNearness operators– A A near(n)near(n) B, A B, A near_before(n)near_before(n) B B

Indexing of both structure and contentIndexing of both structure and content

Page 42: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 42

Restricting result regions Restricting result regions

Display first 30 bytes of first 3 paragraphs:Display first 30 bytes of first 3 paragraphs:

turni> sgrep -o"%r ...\n" -g xml \turni> sgrep -o"%r ...\n" -g xml \''first_bytesfirst_bytes(30, (30, firstfirst(3, stag("p") ..(3, stag("p") .. etag("p")))' REC-xml-19980210.xml etag("p")))' REC-xml-19980210.xml<p>The Extensible Markup Langua...<p>The Extensible Markup Langua...<p>This document has been revie...<p>This document has been revie...<p><p>This document specifies a s...This document specifies a s...

Page 43: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 43

Restricting results (2)Restricting results (2)

Get end tags of the children of the document Get end tags of the children of the document element:element:turni> sgrep -o"%r\n" -g xml \turni> sgrep -o"%r\n" -g xml \'etag("*") containing 'etag("*") containing last_byteslast_bytes(1, (1,

outer(outer(elementselements in in elementselements))' \))' \REC-xml-19980210.xmlREC-xml-19980210.xml

</header></header></body></body>

</back></back>

Page 44: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 44

Indexing in sgrep 2.0Indexing in sgrep 2.0

Both the structure and content (words) indexed Both the structure and content (words) indexed (max file size 2 GB)(max file size 2 GB)

Creates a separate index (Creates a separate index (postings filepostings file) ) – terms with region lists of their occurrencesterms with region lists of their occurrences– index a compressed binary file, size 30-60% of the index a compressed binary file, size 30-60% of the

original filesoriginal files

Increases the efficiency of accessing static Increases the efficiency of accessing static document collections considerablydocument collections considerably– see the next examplesee the next example

Page 45: SDPL 2001Notes 8:1 Region Algebra1 8. Querying Structured Text n How to query the structure and contents of structured documents? 8.1 Text-region algebra,

SDPL 2001 Notes 8:1 Region Algebra 45

An indexing exampleAn indexing example

Index a 64-fold copy of the XML Rec Index a 64-fold copy of the XML Rec ((S64S64, 10 MB), 10 MB)::> sgrep > sgrep -I-I -g xml -g xml -c S64.index-c S64.index S64 S64> ls -l S64.index> ls -l S64.index…… 3621954 Feb 8 21:01 S64.index3621954 Feb 8 21:01 S64.index

How often word "Bray" occurs in file How often word "Bray" occurs in file S64S64??> time sgrep -c 'word("Bray")' S64> time sgrep -c 'word("Bray")' S643203203.80user 0.07system 3.80user 0.07system 0:03.870:03.87elapsed elapsed

The same using the index:The same using the index:> time sgrep -c > time sgrep -c -x-x S64.indexS64.index 'word("Bray")' 'word("Bray")'3203200.02user 0.01system 0.02user 0.01system 0:00.030:00.03elapsedelapsed

A 100-fold speed-up!A 100-fold speed-up!