xml retrieval with slides of c. manning und h.schutze
DESCRIPTION
XML Retrieval with slides of C. Manning und H.Schutze. Outline. What is XML? Challenges in XML retrieval Vector space model for XML retrieval Evaluation of text-centric XML retrieval. What is XML?. eXtensible Markup Language A framework for defining markup languages - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/1.jpg)
XML Retrieval
with slides of C. Manning und H.Schutze
04/12/2008
![Page 2: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/2.jpg)
Outline
04/12/2008 2
• What is XML?• Challenges in XML retrieval• Vector space model for XML retrieval• Evaluation of text-centric XML retrieval
![Page 3: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/3.jpg)
What is XML?
• eXtensible Markup Language• A framework for defining markup languages• No fixed collection of markup tags• Each XML language targeted for application• All XML languages share features• Enables building of generic tools
04/12/2008 3
![Page 4: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/4.jpg)
XML Example
<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage
the <tm>FileCab</tm>inet application.
</para> </chapter>
404/12/2008
![Page 5: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/5.jpg)
Basic Structure• An XML document is an ordered, labeled tree • character data leaf nodes contain the actual data
(text strings) • element nodes, are each labeled with
a name (often called the element type), and a set of attributes, each consisting of a name and a value,
can have child nodes
504/12/2008
![Page 6: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/6.jpg)
Elements• Elements are denoted by markup tags• <foo attr1=“value” … > thetext </foo>• Element start tag: foo• Attribute: attr1• The character data: thetext• Matching element end tag: </foo>
604/12/2008
![Page 7: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/7.jpg)
Why Use XML?• Represent semi-structured data
data that are structured, but don’t fit relational model
• XML is more flexible than DBs• XML is more structured than simple IR• You get a massive infrastructure for free
704/12/2008
![Page 8: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/8.jpg)
XML Schemas• Schema = syntax definition of XML language• Schema language = formal language for expressing
XML schemas• Examples
Document Type Definition XML Schema (W3C)
• Relevance for XML IR Our job is much easier if we have a (one) schema
804/12/2008
![Page 9: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/9.jpg)
Challenges in XML Retrieval
04/12/2008
![Page 10: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/10.jpg)
Data vs. Text-centric XML• Data-centric XML: used for messaging between
enterprise applications Mainly a recasting of relational data Numerical and non-text data dominate Mostly stored in the databases
• Text-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality
Queries are user information needs. E.g., give me the Section (element) of the document that tells me how to change a brake light
1004/12/2008
![Page 11: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/11.jpg)
Text search in RDB
Highly structured text search problems are most efficiently handled by a relational database
Information need: find all employees who are involved with invoicing
SQL query: select lastname from employees where job_desc like 'invoic%';
satisfies the information need with high precision and recall.
04/12/2008 11
id lastname job_desc salary ……invoicing…
![Page 12: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/12.jpg)
Structured Retrieval: Data
Many structured data sources containing text are best modeled as structured documents rather than relational data.
Search over structured documents: structured retrieval.
Queries in structured retrieval can be either structured or unstructured (here we assume that the collection consists only of structured documents).
Applications of structured retrieval include: digital libraries, patent databases, text with tagged entities (e.g. persons and locations), application output as marked up text.
04/12/2008 12
![Page 13: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/13.jpg)
Structured Retrieval: Query
Queries combine textual criteria with structural criteria. Information need: Articles about sightseeing tours of the
Vatican and the Coliseum.Usability of structured queries • knowledge of the data structure
sightseeing AND (COUNTRY:Vatican OR LANDMARK:Coliseum)sightseeing AND (STATE:Vatican OR BUILDING:Coliseum)
• syntax of the query language (XQuery)
• low recall in Boolean model
04/12/2008 13
for $a in doc()//au
let
![Page 14: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/14.jpg)
• Indexing unit: there is no document unit in XML Book? Chapter?
• How do we compute tf and idf?• Global tf/idf over all text context is useless• Indexing granularity
Structured document retrieval principle. A system should always retrieve the most specific part of a document answering the query.
IR XML Challenges: Indexing Units & Term Statistics
1404/12/2008
![Page 15: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/15.jpg)
Partitioning an XML document into non-overlapping indexing units
1504/12/2008
![Page 16: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/16.jpg)
IR XML Challenges: Schemas• Schema heterogeneity
Many schemas Schemas not known in advance Schemas change Users don’t understand schemas
• Need to identify similar elements in different schemas Example: name, surname, family name
1604/12/2008
![Page 17: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/17.jpg)
• Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender
• What is the query language you expose to the user? Specific XML query language? No. Forms? Parametric search? A textbox?
• In general: design layer between XML and user
IR XML Challenges: UI
17
![Page 18: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/18.jpg)
Vector Spaces and XML
![Page 19: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/19.jpg)
Vector spaces and XML• Vector spaces – tried+tested framework for
keyword retrieval Other “bag of words” applications in text: classification, clustering …
• For text-centric XML retrieval, can we make use of vector space ideas?
• Challenge: capture the structure of an XML document in the vector space.
19
![Page 20: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/20.jpg)
Vector spaces and XML• For instance, distinguish between the following
two cases
Book
Title Author
Bill GatesMicrosoft
Book
Title Author
Bill WulfThe PearlyGates
20
![Page 21: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/21.jpg)
Book
Title Author
BillMicrosoft
Book
Title Author
WulfPearlyGatesGatesThe
Bill
Lexicon terms.
Content-rich XML: representation
21
![Page 22: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/22.jpg)
Encoding the Gates differently• What are the axes of the vector space?• In text retrieval, there would be a single axis
for Gates• Here we must separate out the two occurrences,
under Author and Title• Thus, axes must represent not only terms, but something about
their position in an XML tree
22
![Page 23: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/23.jpg)
Queries• Before addressing this, let us consider the kinds
of queries we want to handle
Book
Title
Microsoft
Book
Title Author
Gates Bill
23
![Page 24: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/24.jpg)
Subtrees and structure• Consider all subtrees of the document that
include at least one lexicon term:Book
Title Author
BillMicrosoft Gates
BillMicrosoft Gates
Title
Microsoft
Author
Bill
Author
Gates
Book
Title
Microsoft Bill
Book
Author
Gates
e.g.
…
24
![Page 25: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/25.jpg)
Structural terms• Call each of the resulting (8+, in the previous
slide) subtrees a structural term• Note that structural terms might occur multiple
times in a document• Create one axis in the vector space for each
distinct structural term• Weights based on frequencies for number of
occurrences (just as we had tf)• All the usual issues with terms (stemming? Case
folding?) remain
25
![Page 26: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/26.jpg)
• Here the structural terms containing to or be would have more weight than those that don’t
Play
Act
To be or not to be
Play
Act
be
Play
Act
or
Play
Act
not
Play
Act
to
Exercise: How many axes are there in this example?
Example of tf weighting
26
![Page 27: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/27.jpg)
Structural terms: docs+queries• The notion of structural terms is independent of
any schema/DTD for the XML documents• Well-suited to a heterogeneous collection of XML
documents• Each document becomes a vector in the space of
structural terms• A query tree can likewise be factored into
structural terms And represented as a vector Allows weighting portions of the query
27
![Page 28: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/28.jpg)
The catch remains• This is all very promising, but …• How big is this vector space?• Can be exponentially large in the size of the
document• Cannot hope to build such an index• And in any case, still fails to answer queries
likeBook
Gates
(somewhere underneath)
28
![Page 29: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/29.jpg)
Descendants
Gates
Author Book
Author
Bill Gates
vs.Book
Author
Bill Gates
FirstName LastName
No known DTD.Query seeks Gates under Author.
29
![Page 30: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/30.jpg)
Handling descendants in the vector space• Devise a match function that yields a score in [0,1] between structural terms
• E.g., when the structural terms are paths, measure overlap
• The greater the overlap, the higher the match score Can adjust match for where the overlap occurs
Book
Author
Bill
Book
Author
Bill
LastName
Book
Bill
vs. in
30
![Page 31: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/31.jpg)
How do we use this in retrieval?• First enumerate structural terms in the query• Measure each for match against the dictionary of
structural terms Just like a postings lookup, except not Boolean (does the term exist)
Instead, produce a score that says “80% close to this structural term”, etc.
• Then, retrieve docs with that structural term, compute cosine similarities, etc.
31
![Page 32: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/32.jpg)
Example of a retrieval step
ST1 Doc1 (0.7) Doc4 (0.3) Doc9 (0.2)
ST = Structural Term
ST5 Doc3 (1.0) Doc6 (0.8) Doc9 (0.6)
IndexQuery ST
Match=0.63
Now rank the Doc’s by cosine similarity;e.g., Doc9 scores 0.578.
32
![Page 33: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/33.jpg)
Closing technicalities• But what exactly is a Doc?• In a sense, an entire corpus can be viewed as an
XML documentCorpus
Doc1 Doc2 Doc3 Doc4
33
![Page 34: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/34.jpg)
What are the Doc’s in the index?• Anything we are prepared to return as an answer• Could be nodes, some of their children …
34
![Page 35: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/35.jpg)
What are queries we can’t handle using vector spaces?• Find figures that describe the Corba architecture
and the paragraphs that refer to those figures Requires JOIN between 2 tables
• Retrieve the titles of articles published in the Special Feature section of the journal IEEE Micro Depends on order of sibling nodes.
Requires <articles> to appear after a specific <sec> node.
Query tree cannot express relations that depend on node ordering.
35
<journal><title>…</title><sec1><title>…</title></sec1><article>…</article><article>…</article></journal>
![Page 36: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/36.jpg)
Can we do IDF?• Yes, but doesn’t make sense to do it corpus-wide• Can do it, for instance, within all text under a
certain element name say Chapter• Yields a tf-idf weight for each lexicon term
under an element• Issues: how do we propagate contributions to
higher level nodes.
36
![Page 37: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/37.jpg)
Example• Say Gates has high IDF under the Author element
• How should it be tf-idf weighted for the Book element?
• Should we use the idf for Gates in Author or that in Book?
Book
Author
Bill Gates
37
![Page 38: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/38.jpg)
INEX: a benchmark for text-centric XML retrieval
![Page 39: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/39.jpg)
INEX• Benchmark for the evaluation of XML retrieval
Analog of TREC (recall IIR8)• Consists of:
Set of XML documents Collection of retrieval tasks
39
![Page 40: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/40.jpg)
INEX• Each engine indexes docs • Engine team converts retrieval tasks into queries
In XML query language understood by engine• In response, the engine retrieves not docs, but
elements within docs Engine ranks retrieved elements
![Page 41: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/41.jpg)
INEX assessment• For each query, each retrieved element is human-assessed on two
measures: Relevance – how relevant is the retrieved element Coverage – is the retrieved element too specific, too general, or
just right E.g., if the query seeks a definition of the Fast Fourier
Transform, do I get the equation (too specific), the chapter containing the definition (too general) or the definition itself
• These assessments are turned into composite precision/recall measures
41
![Page 42: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/42.jpg)
INEX corpus (Ad Hoc Track)• Articles from IEEE Computer Society publications• Wikipedia collection 2009• others
42
![Page 43: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/43.jpg)
INEX topics• Each topic is an information need, one of two
kinds: Content Only (CO) – free text queries Content and Structure (CAS) – explicit structural constraints, e.g., containment conditions.
43
![Page 44: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/44.jpg)
Sample INEX CO topic
<Title> computational biology </Title> <Keywords> computational biology, bioinformatics,
genome, genomics, proteomics, sequencing, protein folding </Keywords>
<Description> Challenges that arise, and approaches being explored, in the interdisciplinary field of computational biology</Description>
<Narrative> To be relevant, a document/component must either talk in general terms about the opportunities at the intersection of computer science and biology, or describe a particular problem and the ways it is being attacked. </Narrative>
44
![Page 45: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/45.jpg)
INEX assessment• Each engine formulates the topic as a query
E.g., use the keywords listed in the topic.• Engine retrieves one or more elements and ranks
them.• Human evaluators assign to each retrieved element
relevance and coverage scores.
45
![Page 46: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/46.jpg)
Assessments• Relevance assessed on a scale from Irrelevant
(scoring 0) to Highly Relevant (scoring 3)• Coverage assessed on a scale with four levels:
No Coverage (N: the query topic does not match anything in the element
Too Large (L: The topic is only a minor theme of the element retrieved)
Too Small (S: the element is too small to provide the information required)
Exact Coverage(E).• So every element returned by each engine has
ratings from {0,1,2,3} × {N,S,L,E}
46
![Page 47: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/47.jpg)
Combining the relevance/coverage assessments• Define scores:
47
![Page 48: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/48.jpg)
The Q-values• Scalar measure of goodness of a retrieved elements• Can compute Q-values for varying numbers of
retrieved elements 10, 20 … etc. Means for comparing engines.
48
![Page 49: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/49.jpg)
From Q-values to … ?• INEX provides a method for turning these into
precision-recall curves• “Standard” issue: only elements returned by some
participant engine are assessed• Lots more commentary (and proceedings from
previous INEX bakeoffs): http://www.inex.otago.ac.nz
49
![Page 50: XML Retrieval with slides of C. Manning und H.Schutze](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c8c550346895dca9c78/html5/thumbnails/50.jpg)
Resources• Chapter 10 of IIR• Resources at http://ifnlp.org/ir• INEX http://www.inex.otago.ac.nz/
50