sdpl 2001notes 8:1 region algebra1 8. querying structured text n how to query the structure and...
TRANSCRIPT
SDPL 2001 Notes 8:1 Region Algebra 1
8. Querying Structured Text8. Querying Structured Text
How to query the structure and contents of How to query the structure and contents of structured documents?structured documents?8.1 Text-region algebra, and search tool sgrep8.1 Text-region algebra, and search tool sgrep
» a simple model and tool for a simple model and tool for content retrievalcontent retrieval of arbitrary of arbitrary structured text files, based on concrete document syntaxstructured text files, based on concrete document syntax
8.2 W3C XML query language XQuery8.2 W3C XML query language XQuery » a rich a rich query and data manipulation languagequery and data manipulation language for all for all
types of XML data sources, based on conceptual content types of XML data sources, based on conceptual content of XML documents ("XML Information Set")of XML documents ("XML Information Set")
SDPL 2001 Notes 8:1 Region Algebra 2
8.1 Region algebra (and sgrep)8.1 Region algebra (and sgrep)
the model: the model: region algebraregion algebra– relatively low-level but appropriate for e.g. XMLrelatively low-level but appropriate for e.g. XML– documents seen as contiguous portions called documents seen as contiguous portions called regionsregions
an implementation: an implementation: sgrepsgrep– ““structured grep”, command-line based search toolstructured grep”, command-line based search tool
» version 0.99 (Unix/Linux): basic featuresversion 0.99 (Unix/Linux): basic features» version 1.92a (Unix/Linux/Win32): indexing, version 1.92a (Unix/Linux/Win32): indexing,
SGML/XML/HTML support, other additional featuresSGML/XML/HTML support, other additional features– extracts portions of files/input streams extracts portions of files/input streams
SDPL 2001 Notes 8:1 Region Algebra 3
Region algebra: BackgroundRegion algebra: Background
PatPatTMTM (University of Waterloo; OpenText) (University of Waterloo; OpenText)– efficient full-text index (suffix array)efficient full-text index (suffix array)– match-point delimited regionsmatch-point delimited regions
» result sets of non-overlapping regions only; result sets of non-overlapping regions only; overlapping results represented as their start points overlapping results represented as their start points semantic problems semantic problems
‘‘‘‘generalized concordance lists’’ generalized concordance lists’’ – Clarke, Cormack and Burkowski 1995Clarke, Cormack and Burkowski 1995– overlaps but no nesting in resultsoverlaps but no nesting in results
SDPL 2001 Notes 8:1 Region Algebra 4
Nested region algebraNested region algebra
Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999Jaakkola & Kilpeläinen, U. of Helsinki 1995-1999 retrieval of document components as dynamically retrieval of document components as dynamically
defined regions, based on their mutual ordering defined regions, based on their mutual ordering and nestingand nesting
no restrictions on the length, overlapping or no restrictions on the length, overlapping or nesting of regionsnesting of regions
no restrictions on document formatsno restrictions on document formats– regular markup formats (a’la XML) most appropriateregular markup formats (a’la XML) most appropriate
SDPL 2001 Notes 8:1 Region Algebra 5
Basics of region algebra data modelBasics of region algebra data model
Set of Set of positionspositions U= U= {0, …, {0, …, nn-1} comprising a text -1} comprising a text of length of length nn– characters/bytes (sgrep), or wordscharacters/bytes (sgrep), or words
RegionsRegions: : contiguous (non-empty) sequences of contiguous (non-empty) sequences of positions (normally fragments of files)positions (normally fragments of files)– typically occurrences of a string,typically occurrences of a string,
or document elements or document elements– regionregion a a = { = {s, ss, s+1+1…, e…, e} denoted by (} denoted by (s, es, e) )
or (or (a.s, a.ea.s, a.e), by the ), by the startstart and and end positionend position of region of region aa
SDPL 2001 Notes 8:1 Region Algebra 6
Positions in sgrepPositions in sgrep
Sample file (under Linux):Sample file (under Linux):% cat abra.txt% cat abra.txtabraabracadcadabraabra$$
Regions normally occurrences of a string:Regions normally occurrences of a string:% sgrep -o"(%s,%e)" '% sgrep -o"(%s,%e)" '"abra""abra"' abra.txt' abra.txt(0,3)(7,10)(0,3)(7,10)– output format template output format template "(%s,%e)" "(%s,%e)" applied to each applied to each
result region ("insert start position and end position in result region ("insert start position and end position in parentheses")parentheses")
SDPL 2001 Notes 8:1 Region Algebra 7
Positions in sgrep (2)Positions in sgrep (2)
The first position:The first position:% sgrep '% sgrep 'startstart' abra.txt' abra.txtaa
The last position:The last position:% sgrep '% sgrep 'endend' abra.txt' abra.txt$$
The entire document (more on operator 'The entire document (more on operator '....'' later):later):% sgrep 'start % sgrep 'start .... end' abra.txt end' abra.txtabracadabra$abracadabra$
SDPL 2001 Notes 8:1 Region Algebra 8
Relationships of regions Relationships of regions aa and and bb
Language based on relationships between Language based on relationships between regions:regions:
1)1) a a precedesprecedes b, b b, b followsfollows a a::– a a ends beforeends before b b starts,starts, a.e < b.s a.e < b.s
2) 2) a a is is included inincluded in b b ( (b b containscontains a): a a): a bb ( (b b aa):):– b.s b.s a.sa.s, and , and a.e a.e b.eb.e– proper containmentproper containment: : a a bb, and , and b b aa
3) if neither of the above, a and b 3) if neither of the above, a and b overlapoverlap
SDPL 2001 Notes 8:1 Region Algebra 9
Region algebraRegion algebra
Region algebra is a set-valued languageRegion algebra is a set-valued language– the value of any expression is a set of regionsthe value of any expression is a set of regions– c.f. relational algebra: set of rows/tuplesc.f. relational algebra: set of rows/tuples
Operations map Operations map region setsregion sets to to region sets region sets a a compositional compositional language: operands of language: operands of expressions can be any other expressionsexpressions can be any other expressions
arbitrarily complex queries can be formed arbitrarily complex queries can be formed
SDPL 2001 Notes 8:1 Region Algebra 10
Building new regionsBuilding new regions
Nested regions (e.g. document elements)Nested regions (e.g. document elements)– ‘‘‘‘followed-by’’ operatorsfollowed-by’’ operators
Non-nested regionsNon-nested regions– ‘‘‘‘quote’’ operatorsquote’’ operators
By removing overlap with other regionsBy removing overlap with other regions– ‘‘‘‘extracting’’ operatorextracting’’ operator
SDPL 2001 Notes 8:1 Region Algebra 11
‘‘‘‘Followed-by’’Followed-by’’
The most central and characteristic operator in The most central and characteristic operator in nested region algebra and sgrepnested region algebra and sgrep
allows dynamic creation of regions from their allows dynamic creation of regions from their bounding regionsbounding regions
generalises the way how parentheses are generalises the way how parentheses are paired together, starting from inside and paired together, starting from inside and always matching the closest unmatched onesalways matching the closest unmatched ones
SDPL 2001 Notes 8:1 Region Algebra 12
Followed-by exampleFollowed-by example
Consider a parenthesised expression:Consider a parenthesised expression:% cat expr.txt% cat expr.txt(1(2(3))(4))(1(2(3))(4))
Extract parenthesised sub-expressions:Extract parenthesised sub-expressions:% sgrep -o"%r% sgrep -o"%r////" '"(" " '"(" .... ")"' expr.txt ")"' expr.txt(1(2(3))(4))(1(2(3))(4))////(2(3))(2(3))////(3)(3)////(4)(4)////
NB: by default, without theNB: by default, without the -o-o output format output format specification, sgrep outputs each position covered specification, sgrep outputs each position covered by result regions only onceby result regions only once
SDPL 2001 Notes 8:1 Region Algebra 13
Value of Value of A .. BA .. B formally formally
A A andand B B sets of regionssets of regions AA .. .. BB = {( = {(a.s, b.ea.s, b.e)} )} UU ( ( ( (AA-{-{aa}) .. (}) .. (BB-{-{bb}) ),}) ),
where where a a A, b A, b BB such that such that a.e < b.sa.e < b.s and the and the a.e .. b.sa.e .. b.s distance is minimal distance is minimal – (and also distance btw (and also distance btw a.s a.s andand b.e b.e is minimal, if there is minimal, if there
are otherwise equidistant pairs)are otherwise equidistant pairs) – empty set, if there are no such empty set, if there are no such a a A A andand b b B B
each each a a A A and eachand each b b B B produces at most produces at most one result region (one result region (a.s, b.ea.s, b.e))
SDPL 2001 Notes 8:1 Region Algebra 14
Regions of document elementsRegions of document elements
Regions delimited by start and end tags:Regions delimited by start and end tags:% sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html% sgrep '"<TITLE>" .. "</TITLE>"' sgrepman.html<TITLE> sgrep - Manual page </TITLE><TITLE> sgrep - Manual page </TITLE>
Delimiting regions can be left-out:Delimiting regions can be left-out:– using also the built-in markup scanner to recognise tags: using also the built-in markup scanner to recognise tags:
% sgrep '% sgrep 'stag("TITLE")stag("TITLE") _._. etag("TITLE")etag("TITLE")' sgrepman.html ' sgrepman.html sgrep - Manual page </TITLE> sgrep - Manual page </TITLE>% sgrep 'stag("TITLE") % sgrep 'stag("TITLE") ._._ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html <TITLE> sgrep - Manual page <TITLE> sgrep - Manual page% sgrep 'stag("TITLE") % sgrep 'stag("TITLE") ____ etag("TITLE")' sgrepman.html etag("TITLE")' sgrepman.html sgrep - Manual page sgrep - Manual page
(possible empty regions are excluded from (possible empty regions are excluded from A __ BA __ B) )
SDPL 2001 Notes 8:1 Region Algebra 15
‘‘‘‘Quote’’ operatorsQuote’’ operators
Documents typically contain also disjoint regions, Documents typically contain also disjoint regions, often delimited by identical start and end markersoften delimited by identical start and end markers– e.g. string constants in programming languagese.g. string constants in programming languages
% cat hello.txt% cat hello.txt"Hello," said Bob, "nice day! ""Hello," said Bob, "nice day! "% sgrep '"\"" % sgrep '"\"" quotequote "\""' hello.txt "\""' hello.txt "Hello,""nice day!" "Hello,""nice day!"
– starting or ending regions, or both, can be excluded starting or ending regions, or both, can be excluded (similarly to(similarly to _., ._ _., ._ and and ____) using) using _quote_quote, , quote_quote_ and and _quote__quote_
SDPL 2001 Notes 8:1 Region Algebra 16
Extracting overlapping regionsExtracting overlapping regions
A A extractingextracting B B = regions that result by removing = regions that result by removing from regions of from regions of AA any overlap with regions in any overlap with regions in BB
Contents of sections without subsections:Contents of sections without subsections:
sgrep '"<sect>" .. "</sect>" sgrep '"<sect>" .. "</sect>" extractingextracting ("<subsect>" .. "</subsect>")' ex.xml("<subsect>" .. "</subsect>")' ex.xml
Content without markup tags:Content without markup tags:
sgrep 'start .. end sgrep 'start .. end extractingextracting ("<" ..">")' *.html("<" ..">")' *.html
Note: No operator precedence; Parenthesised sub-Note: No operator precedence; Parenthesised sub-expressions are evaluated first, otherwise left-to-rightexpressions are evaluated first, otherwise left-to-right
SDPL 2001 Notes 8:1 Region Algebra 17
Containment conditionsContainment conditions
Allow selecting regions based on their Allow selecting regions based on their context or contentcontext or content
Operators for selecting regions that Operators for selecting regions that – appear / do not appear in a given contextappear / do not appear in a given context– contain / do not contain a region of another setcontain / do not contain a region of another set
Similar operators in other variations of Similar operators in other variations of region algebraregion algebra
SDPL 2001 Notes 8:1 Region Algebra 18
Containment operators formallyContainment operators formally
A A inin B: { B: {aaAA b b BB:: a a b b
A A not innot in B: { B: {aaAA b b BB:: a a b b
A A containingcontaining B: { B: {aaAA b b BB:: a a b b
A A not containingnot containing B: { B: {aaAA b b BB:: a abb
SDPL 2001 Notes 8:1 Region Algebra 19
Set operationsSet operations
Union, intersection and difference:Union, intersection and difference: A A oror B, A B, A equalequal B, A B, A not equalnot equal B B
Rather seldom needed (except for ‘Rather seldom needed (except for ‘oror’);’);containment conditions otherwise sufficient containment conditions otherwise sufficient for expressing Boolean retrieval for expressing Boolean retrieval (See next)(See next)
SDPL 2001 Notes 8:1 Region Algebra 20
Expressing Boolean queriesExpressing Boolean queries
HTML files (HTML files (%f%f) containing “cat” AND “dog”:) containing “cat” AND “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end
containingcontaining "cat" "cat" containingcontaining "dog"' *.html "dog"' *.html
HTML files containing “cat” or “dog”:HTML files containing “cat” or “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end
containingcontaining ("cat" ("cat" oror "dog")' *.html "dog")' *.html
HTML files containing “cat” but no “dog”:HTML files containing “cat” but no “dog”: sgrep -o"%f\n" 'start .. end sgrep -o"%f\n" 'start .. end containingcontaining "cat" "cat" not containingnot containing "dog"' *.html "dog"' *.html
SDPL 2001 Notes 8:1 Region Algebra 21
Structural retrievalStructural retrieval
Arbitrarily complex queries, e.g., ‘‘extract titles of Arbitrarily complex queries, e.g., ‘‘extract titles of pages which mention cats, but not in the title of the pages which mention cats, but not in the title of the page’’:page’’: sgrep '"<TITLE>" .. "</TITLE>" in sgrep '"<TITLE>" .. "</TITLE>" in (start .. end containing (start .. end containing ("cat" not in ("<TITLE>" .. "</TITLE>"))' ("cat" not in ("<TITLE>" .. "</TITLE>"))' *.html *.html
But notice: The model supports (and sgrep implements) But notice: The model supports (and sgrep implements) only extracting the regions that satisfy the query, in orderonly extracting the regions that satisfy the query, in order– with restricted modification of each region (usingwith restricted modification of each region (using -o) -o)
SDPL 2001 Notes 8:1 Region Algebra 22
Additional operations on region setsAdditional operations on region sets
concatconcat(A): minimal set of regions covering (A): minimal set of regions covering exactly the regions of Aexactly the regions of A– default result formatting of sgrepdefault result formatting of sgrep
innerinner(A) = A not containing A(A) = A not containing A– the innermost regions in Athe innermost regions in A
outerouter(A) = A not in A(A) = A not in A– the outermost regions in Athe outermost regions in A– e.g. to get the document root element:e.g. to get the document root element:
outerouter((elementselements))
SDPL 2001 Notes 8:1 Region Algebra 23
Sgrep: some useful optionsSgrep: some useful options
-a-a: Display result regions surrounded by the rest of : Display result regions surrounded by the rest of the file; useful with the file; useful with -o-o (below) (below)
-o-o <style><style>: Set output format string, possibly : Set output format string, possibly containing following place holders:containing following place holders:– %f%f: name of the file containing the start of region,: name of the file containing the start of region,– %s, %e%s, %e: start and end position,: start and end position,– %r%r: the content (text) of the region,: the content (text) of the region,– %n%n: the ordinal number of the region, (+ few others); : the ordinal number of the region, (+ few others);
Output once for each result region, with current values Output once for each result region, with current values substituted for place holderssubstituted for place holders
SDPL 2001 Notes 8:1 Region Algebra 24
Sgrep: more useful optionsSgrep: more useful options
-c -c display only count of matching regions display only count of matching regions -h-h a short help (list of options) a short help (list of options) -f file-f file read commands from read commands from filefile -i -i ignore case distinctions in phrases ignore case distinctions in phrases -S-S stream mode (regions extend across files) stream mode (regions extend across files) -T/-t-T/-t show statistics about things done/time show statistics about things done/time
spentspent For more, see the man page and READMEFor more, see the man page and README
SDPL 2001 Notes 8:1 Region Algebra 25
Applications of sgrepApplications of sgrep
Simple document assembly Simple document assembly – Jaakkola and Kilpeläinen: Using sgrep for querying Jaakkola and Kilpeläinen: Using sgrep for querying
structured text files, structured text files, SGML Finland’96SGML Finland’96
Analysing element structures Analysing element structures As a Web site search engine As a Web site search engine
– E.g. Chapter 7 in Leventhal, Lewis & Fuchs: E.g. Chapter 7 in Leventhal, Lewis & Fuchs: Designing Designing XML Internet ApplicationsXML Internet Applications
– Using sgrep indexer to speed up querying static filesUsing sgrep indexer to speed up querying static files
SDPL 2001 Notes 8:1 Region Algebra 26
An assembly system prototypeAn assembly system prototype
SDPL 2001 Notes 8:1 Region Algebra 27
Analysing element structuresAnalysing element structures
Structure of element instances in a collection? Structure of element instances in a collection? The DTD does not tell!The DTD does not tell!
Element statistics:Element statistics: lu 1751 (min: 67, avg: 6849.37, max: 115620)lu 1751 (min: 67, avg: 6849.37, max: 115620) 41 -> 98 huom 41 -> 98 huom 1751 -> 1751 nu 1751 -> 1751 nu 1748 -> 1748 ot 1748 -> 1748 ot
1751 1751 lulu elements, with shown minimum, average and elements, with shown minimum, average and maximum lenghts; 41 contain maximum lenghts; 41 contain huomhuom elements directly, and elements directly, and 98 98 huomhuom elements are children ofelements are children of lulu elementselements..
SDPL 2001 Notes 8:1 Region Algebra 28
Structure analysis: implementation Structure analysis: implementation
Generated by a Python script Generated by a Python script each datum (element type name, count, each datum (element type name, count,
length) computed by generating and length) computed by generating and executing an sgrep queryexecuting an sgrep query
SDPL 2001 Notes 8:1 Region Algebra 29
Implementation of sgrepImplementation of sgrep
Steps in executing an sgrep query?Steps in executing an sgrep query?1. Preprocessing 1. Preprocessing
» macro expansion; external & optionalmacro expansion; external & optional
2. Query parsing2. Query parsing
3. Query optimisation3. Query optimisation
4. String retrieval on the text4. String retrieval on the text
5. Operator evaluation5. Operator evaluation
6. Data delivery6. Data delivery» Outputting of result regionsOutputting of result regions
SDPL 2001 Notes 8:1 Region Algebra 30
1. Preprocessing 1. Preprocessing
Consider the below query Q:Consider the below query Q:sgrep -f macros -e'outer(S in S)' a.xml b.xmlsgrep -f macros -e'outer(S in S)' a.xml b.xml
With suitable macro definitions Q becomes With suitable macro definitions Q becomes outer(("<sec>" .. "</sec>") in outer(("<sec>" .. "</sec>") in
("<sec>" .. "</sec>"))("<sec>" .. "</sec>"))
m 4 preprocessormacros
expanded query
Q
SDPL 2001 Notes 8:1 Region Algebra 31
2. Query parsing2. Query parsing
– Query into an Query into an operator tree:operator tree:» operators as internal nodes, string phrases as operators as internal nodes, string phrases as
leavesleaves
outerouter
inin
"<sec>""<sec>"
....
"</sec>""</sec>" "<sec>""<sec>"
....
"</sec>""</sec>"
SDPL 2001 Notes 8:1 Region Algebra 32
3. Query optimisation 3. Query optimisation
– common subexpression eliminationcommon subexpression elimination– operator tree to an operator tree to an operator graphoperator graph (DAG) (DAG)
» e.g, an operator tree of 735 nodes (for BibTeX e.g, an operator tree of 735 nodes (for BibTeX records) reduces to a DAG of 103 nodesrecords) reduces to a DAG of 103 nodes
outerouter
inin
"<sec>""<sec>"
....
"</sec>""</sec>"
SDPL 2001 Notes 8:1 Region Algebra 33
4. String retrieval (1)4. String retrieval (1)
– A deterministic Aho-Corasick automaton A deterministic Aho-Corasick automaton M M built of string patterns in the querybuilt of string patterns in the query
M:
>s e c
>s e c<
/
SDPL 2001 Notes 8:1 Region Algebra 34
String retrieval (2)String retrieval (2)
– Input files (or streams) scanned using Input files (or streams) scanned using MM» each pattern simultaneouslyeach pattern simultaneously» region list region list of pattern occurrences attached to of pattern occurrences attached to
the leaves of the operator graphthe leaves of the operator graphouterouter
inin
"<sec>""<sec>"
....
"</sec>""</sec>"[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]
Alternatively, region lists can be Alternatively, region lists can be obtained by a look-up from a obtained by a look-up from a pre-computed index (sgrep pre-computed index (sgrep 1.92), without scanning target 1.92), without scanning target filesfiles
SDPL 2001 Notes 8:1 Region Algebra 35
5. Operator evaluation5. Operator evaluation
Bottom-up traversal of the operator graph Bottom-up traversal of the operator graph – value of node: a value of node: a region listregion list
» of (start, end) index pairsof (start, end) index pairs» internal representation of region setsinternal representation of region sets» maintained in increasing (start, end) ordermaintained in increasing (start, end) order
– each node evaluated onceeach node evaluated once– (sub)queries with large results may require lots (sub)queries with large results may require lots
of main memory!of main memory!
SDPL 2001 Notes 8:1 Region Algebra 36
Operator graph evaluationOperator graph evaluation
outerouter
inin
"<sec>""<sec>"
....
"</sec>""</sec>"
[(5,9), (26,30), …][(5,9), (26,30), …] [(20,25), (100,105), …][(20,25), (100,105), …]
[(5,25), (26,105), …][(5,25), (26,105), …]
[(1000,1500), (1010,1200), …][(1000,1500), (1010,1200), …]
[(1000,1500), (1800,1923), …][(1000,1500), (1800,1923), …]
SDPL 2001 Notes 8:1 Region Algebra 37
6. Data delivery6. Data delivery
Finally, the regions in the result region list Finally, the regions in the result region list are output in document orderare output in document order– text of regions retrieved by indexing target files text of regions retrieved by indexing target files
by start and end positions of the result regionsby start and end positions of the result regions– result regions possibly modified according to result regions possibly modified according to
the output style specification (given with option the output style specification (given with option -o-o))
SDPL 2001 Notes 8:1 Region Algebra 38
Evaluating region algebra operatorsEvaluating region algebra operators
Synchronised (merge-like, linear) traversal Synchronised (merge-like, linear) traversal of operand lists, of operand lists, except that ...except that ...– A A .... B B requires sorting of requires sorting of AA (by region end (by region end
positions) if positions) if AA contains nested regions contains nested regions– A A extractingextracting B B may require considering the may require considering the
same regions of A multiple times same regions of A multiple times
SDPL 2001 Notes 8:1 Region Algebra 39
Evaluation complexityEvaluation complexity
Worst-case time Worst-case time OO((t nt n), where ), where nn maximum maximum size and size and t t maximum ''thickness'' (number of maximum ''thickness'' (number of regions overlapping at any position) of any regions overlapping at any position) of any region setregion set– Jaakkola and Kilpeläinen: Nested Text-Region Jaakkola and Kilpeläinen: Nested Text-Region
Algebra, January 1999.Algebra, January 1999. In practise linear time;In practise linear time; comparable to Unix grepscomparable to Unix greps
SDPL 2001 Notes 8:1 Region Algebra 40
Extended features in sgrep 2Extended features in sgrep 2
(pre-release v. 1.92a)(pre-release v. 1.92a) documented (only!) in the README filedocumented (only!) in the README file XML/HTML/SGML supportXML/HTML/SGML support
– scanner to recognise markup tokensscanner to recognise markup tokens » SGML/HTML-mode (default)SGML/HTML-mode (default)
markup names converted to UPPER CASE markup names converted to UPPER CASE
» XML-mode: case-sensitive retrieval of tag namesXML-mode: case-sensitive retrieval of tag names– simple-minded parser to recognise elementssimple-minded parser to recognise elements– 16-bit wide characters in XML documents16-bit wide characters in XML documents– NBNB: currently no expanding of entity references; : currently no expanding of entity references;
No validation or well-formedness checkingNo validation or well-formedness checking
SDPL 2001 Notes 8:1 Region Algebra 41
Extended sgrep features (2)Extended sgrep features (2)
Direct containment:Direct containment:– A A childreningchildrening B, A B, A parentingparenting B B
Restricting the number of result regionsRestricting the number of result regions– first(n, E), last(n, E) first(n, E), last(n, E)
Restricting the length of result regionsRestricting the length of result regions– first_bytes(n, E), last_bytes(n, E)first_bytes(n, E), last_bytes(n, E)
Nearness operatorsNearness operators– A A near(n)near(n) B, A B, A near_before(n)near_before(n) B B
Indexing of both structure and contentIndexing of both structure and content
SDPL 2001 Notes 8:1 Region Algebra 42
Restricting result regions Restricting result regions
Display first 30 bytes of first 3 paragraphs:Display first 30 bytes of first 3 paragraphs:
turni> sgrep -o"%r ...\n" -g xml \turni> sgrep -o"%r ...\n" -g xml \''first_bytesfirst_bytes(30, (30, firstfirst(3, stag("p") ..(3, stag("p") .. etag("p")))' REC-xml-19980210.xml etag("p")))' REC-xml-19980210.xml<p>The Extensible Markup Langua...<p>The Extensible Markup Langua...<p>This document has been revie...<p>This document has been revie...<p><p>This document specifies a s...This document specifies a s...
SDPL 2001 Notes 8:1 Region Algebra 43
Restricting results (2)Restricting results (2)
Get end tags of the children of the document Get end tags of the children of the document element:element:turni> sgrep -o"%r\n" -g xml \turni> sgrep -o"%r\n" -g xml \'etag("*") containing 'etag("*") containing last_byteslast_bytes(1, (1,
outer(outer(elementselements in in elementselements))' \))' \REC-xml-19980210.xmlREC-xml-19980210.xml
</header></header></body></body>
</back></back>
SDPL 2001 Notes 8:1 Region Algebra 44
Indexing in sgrep 2.0Indexing in sgrep 2.0
Both the structure and content (words) indexed Both the structure and content (words) indexed (max file size 2 GB)(max file size 2 GB)
Creates a separate index (Creates a separate index (postings filepostings file) ) – terms with region lists of their occurrencesterms with region lists of their occurrences– index a compressed binary file, size 30-60% of the index a compressed binary file, size 30-60% of the
original filesoriginal files
Increases the efficiency of accessing static Increases the efficiency of accessing static document collections considerablydocument collections considerably– see the next examplesee the next example
SDPL 2001 Notes 8:1 Region Algebra 45
An indexing exampleAn indexing example
Index a 64-fold copy of the XML Rec Index a 64-fold copy of the XML Rec ((S64S64, 10 MB), 10 MB)::> sgrep > sgrep -I-I -g xml -g xml -c S64.index-c S64.index S64 S64> ls -l S64.index> ls -l S64.index…… 3621954 Feb 8 21:01 S64.index3621954 Feb 8 21:01 S64.index
How often word "Bray" occurs in file How often word "Bray" occurs in file S64S64??> time sgrep -c 'word("Bray")' S64> time sgrep -c 'word("Bray")' S643203203.80user 0.07system 3.80user 0.07system 0:03.870:03.87elapsed elapsed
The same using the index:The same using the index:> time sgrep -c > time sgrep -c -x-x S64.indexS64.index 'word("Bray")' 'word("Bray")'3203200.02user 0.01system 0.02user 0.01system 0:00.030:00.03elapsedelapsed
A 100-fold speed-up!A 100-fold speed-up!