web bar 2004 advanced retrieval and web mining lecture 11
TRANSCRIPT
![Page 1: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/1.jpg)
WEB BAR 2004 Advanced Retrieval and Web Mining
Lecture 11
![Page 2: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/2.jpg)
Overview Monday
XML Clustering 1
Tuesday Clustering 2 Clustering 3, Interactive Retrieval
Wednesday Classification 1 Classification 2
Thursday Classification 3 Information Extraction
Friday Bioinformatics Projects
Joker Active learning in Text Mining
![Page 3: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/3.jpg)
Today’s Topics
Quick XML intro XML indexing and search
Database approach Xquery 2 IR approaches
![Page 4: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/4.jpg)
What is XML?
eXtensible Markup Language A framework for defining markup
languages No fixed collection of markup tags Each XML language targeted for
application All XML languages share features Enables building of generic tools
![Page 5: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/5.jpg)
Basic Structure
An XML document is an ordered, labeled tree
character data at the leaf nodes contain the actual data (text strings)
element nodes are each labeled with a name (often called the element type), and a set of attributes, each consisting of a
name and a value, can have child nodes
![Page 6: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/6.jpg)
XML Example
![Page 7: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/7.jpg)
XML Example
<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>
![Page 8: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/8.jpg)
Elements
Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>
![Page 9: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/9.jpg)
XML vs HTML
Relationship?
![Page 10: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/10.jpg)
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers)
XML is a framework for defining markup languages
HTML can be formalized as an XML language (XHTML)
XML defines logical structure only HTML: same intention, but has evolved into
a presentation language
![Page 11: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/11.jpg)
XML: Design Goals
Separate syntax from semantics to provide a common framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the future of (semi)structured information (do some of the work now done by databases)
![Page 12: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/12.jpg)
Why Use XML?
Represent semi-structured data (data that are structured, but don’t fit relational model)
XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free
![Page 13: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/13.jpg)
Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language
<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>
![Page 14: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/14.jpg)
XML Schemas
Schema = syntax definition of XML language
Schema language = formal language for expressing XML schemas
Examples DTD XML Schema (W3C)
Relevance for XML information retrieval Our job is much easier if we have a (one)
schema
![Page 15: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/15.jpg)
XML Tutorial
http://www.brics.dk/~amoeller/XML/index.html
(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are
based on their tutorial
![Page 16: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/16.jpg)
XML Indexing and Search
![Page 17: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/17.jpg)
Native XML Database
Uses XML document as logical unit Should support
Elements Attributes PCDATA (parsed character data) Document order
Contrast with DB modified for XML Generic IR system modified for XML
![Page 18: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/18.jpg)
XML Indexing and Search
Most native XML databases have taken a DB approach Exact match Evaluate path expressions No IR type relevance ranking
Only a few that focus on relevance ranking Many types of XML don’t need relevance
ranking If there is a lot of text data, relevance
ranking is usually needed.
![Page 19: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/19.jpg)
Timber: DB extension for XML
DB: search tuples Timber: search trees Main focus
Complex and variable structure of trees (vs. tuples)
Ordering Non-native XML database
without relevance ranking without “IR-type” handling of text
![Page 20: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/20.jpg)
Three Native XML Databases
Toxin Xirql IBM Haifa system
![Page 21: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/21.jpg)
ToXin
Exploits overall path structure Supports any general path query
Query evaluation in three stages Preselection stage Selection stage Postselection stage
![Page 22: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/22.jpg)
ToXin: Motivation
Strawman (Dataguides)
Index all paths occurring in database
Sufficient for simple queries: Find all authors with last
name Smith Does not allow backward
navigation
Example query: find all the titles of articles
authored by Smith
![Page 23: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/23.jpg)
Query Evaluation Stagesfor Backward Navigation
Pre-selection First navigation down the tree
Selection Value selection according to filter
Post-selection Navigation up and down again
![Page 24: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/24.jpg)
ToXin
![Page 25: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/25.jpg)
Evaluation:Factors Impacting Performance
Data source (collection) specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size
Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter
![Page 26: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/26.jpg)
Test Collections
![Page 27: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/27.jpg)
Query Classification
![Page 28: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/28.jpg)
Evaluation
![Page 29: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/29.jpg)
ToXin: Summary
Efficient system supporting structured queries
All paths are indexed (not just from root) Path index linear in corpus size Shortcomings
Order of nodes ignored No IR-type relevance
![Page 30: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/30.jpg)
IR/Relevance Ranking for XML
Why is this difficult?
![Page 31: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/31.jpg)
IR XML Challenge 1: Term Statistics
There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text contexts is
problematic Consider medical collection “new” not a discriminative term in general Very discriminative for journal titles
New England Journal of Medicine
![Page 32: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/32.jpg)
IR XML Challenge 2: Fragments
Which fragments are legitimate to return? Paragraph, abstract, title Bold, italic
IR systems don’t store content (only index) Need to go to document for displaying
fragment Problematic if fragment is not simply a node
![Page 33: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/33.jpg)
Remainder of Lecture
Queries for semi-structured text How they differ from regular IR queries Xquery
Two XML search systems with relevance ranking Xirql IBM Haifa system
![Page 34: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/34.jpg)
Types of (Semi)Structured Queries
Location/position (“chapter no.3”) Simple attribute/value
/play/title contains “hamlet” Path queries
title contains “hamlet” /play//title contains “hamlet”
Complex graphs Employees with two managers
All of the above: mixed structure/content
![Page 35: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/35.jpg)
XPath
Declarative language for Addressing (used in XLink/XPointer and in
XSLT) Pattern matching (used in XSLT and in
XQuery) Location path
a sequence of location steps separated by /
Example: child::section[position()<6] /
descendant::cite / attribute::href
![Page 36: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/36.jpg)
Axes in XPath
ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self
![Page 37: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/37.jpg)
Location steps
A single location step has the form: axis :: node-test [ predicate ]
The axis selects a set of candidate nodes (e.g. the child nodes of the context node).
The node-test performs an initial filtration of the candidates based on their types (chardata node, processing
instruction, etc.), or names (e.g. element name).
The predicates (zero or more) cause a further, more complex, filtration
child::section[position()<6]
![Page 38: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/38.jpg)
XQuery
SQL for XML Usage scenarios
Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)
Based on XPath
![Page 39: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/39.jpg)
XQuery Expressions
path expressions element constructors list expressions conditional expressions quantified expressions datatype expressions
![Page 40: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/40.jpg)
FLWR Expressions
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
![Page 41: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/41.jpg)
XQuery vs SQL
Order matters! document("zoo.xml")//chapter[2]//
figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.
![Page 42: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/42.jpg)
XQuery Example
Møller and Schwartzbach
![Page 43: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/43.jpg)
XQuery 1.0 Standard on Order
Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model.
… if a node in document A is before a node in document B, then every node in document A is before every node in document B.
This structure-oriented ordering can have undesirable effects.
Example: Medline
![Page 44: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/44.jpg)
Document collection = 100s of XML docs, each with thousands of abstracts
<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/
nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation><MedlineID>91060009</MedlineID><DateCreated><Year>1991</Year><Month>01</
Month><Day>10</Day></DateCreated>
<Article>some content</Article></MedlineCitation>
![Page 45: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/45.jpg)
Document collection = 100s of XML docs, each with thousands of abstracts
<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/
nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation> (content)
</MedlineCitation><MedlineCitation> (content)
</MedlineCitation>…</MedlineCitationSet>
![Page 46: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/46.jpg)
How XQuery makes ranking difficult
All documents in collection A must be ranked before all documents in collection B.
Fragments must be ordered in depth-first, left-to-right order.
![Page 47: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/47.jpg)
Semi-Structured Queries
More complex than “unstructured” queries Xquery standard
![Page 48: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/48.jpg)
XIRQL
University of Dortmund Goal: open source XML search engine
Motivation “Returnable” fragments are special
“atomic units” E.g., don’t return a <bold> some text </bold>
fragment Structured Document Retrieval Principle Empower users who don’t know the schema
Enable search for any person_name no matter how schema refers to it
![Page 49: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/49.jpg)
Atomic Units
Specified in schema Only atomic units can be returned as result
of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence”
from atomic units
![Page 50: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/50.jpg)
XIRQL Indexing
![Page 51: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/51.jpg)
Structured Document Retrieval Principle
A system should always retrieve the most specific part of a document answering a query.
Example query: xql Document:
<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>
Return section, not chapter
![Page 52: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/52.jpg)
Augmentation weights
Ensure that Structured Document Retrieval Principle is respected.
Assume different query conditions are disjoint events -> independence.
P(XQL|chapter-N) =P(XQL|chapter-F)+P(sec.|chapter-N)*P(XQL|sec.)-P(XQL|chapter-F)*P(sec.|chapter-N)*P(XQL|
sec.) = 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636
P(XQL|sec.)=0.8 > 0.636=P(XQL|chapter-N) Section ranked ahead of chapter
![Page 53: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/53.jpg)
Datatypes
Example: person_name Assign all elements and attributes with
person semantics to this datatype Allow user to search for “person” without
specifying path
![Page 54: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/54.jpg)
XIRQL: Summary
Relevance ranking Fragment/context selection Datatypes (person_name) Probabilistic combination of evidence
![Page 55: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/55.jpg)
IBM Haifa Approach
Reject XQuery Willing to give up some expressiveness
No joins and backward navigation Find all the titles of articles authored by Smith
Simpler & more efficient approach Represent queries as XML fragments
![Page 56: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/56.jpg)
Query Examples
![Page 57: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/57.jpg)
Extended Weighting Formula
Direct extension of tf.idf cr = context resemblance measure
![Page 58: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/58.jpg)
Context Resemblance Measures
Flat: Perfect match cr(ci,cj):=1 if i==j, cr(ci,cj):=0 otherwise
Partial match cr(ci,cj):= (1+|ci|)/(1+|cj|) if ci subsequence
of cj cr(ci,cj):= 0 otherwise
Fuzzy match For example, string similarity of paths Example?
Ignore context cr(ci,cj) := 1 in all cases
![Page 59: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/59.jpg)
Implementation
Indexing Index term/context pairs: t#c Example: istambul#/country/capital
Retrieval Fetch all contexts of a term
![Page 60: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/60.jpg)
Weighting
IDF per context: compute inverse document frequency for each context separately Problem: not enough data
Global IDF: compute a single global idf weighting
Merge-idf Compute idf for context ci by looking at all
contexts with similarity > 0 Merge-all
Compute tf in analogy to merge-idf
![Page 61: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/61.jpg)
Results
![Page 62: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/62.jpg)
Discussion
Flat best But it depends on the query. Average hides
individual differences. Which queries will flat do well on?
Semantics of XML structure Best case
XML structure corresponds to unit/subunit structure of documents
Worst case (except for flat) Semantics of terms different in different structural
units
![Page 63: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/63.jpg)
IBM Haifa: Summary
Goal: information discovery vs. data exchange, data access via API etc
Queries are XML fragments No separate query language
One of the best performers in Inex bakeoff Extension of vector space Works well for:
Specific context, vague information need Doesn’t work well for
Non-specific context, DB-type information need
![Page 64: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/64.jpg)
XML Summary
DB approach Good for DB-type queries But no relevance ranking And no ordering
Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity,
coherence …) Different approaches to relevance-ranked
XML IR
![Page 65: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/65.jpg)
XML IR challenge: Schemas
Ideally: There is one schema User understands schema
In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas
Need to identify similar elements in different schemas Example: employee
![Page 66: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/66.jpg)
XML IR challenge: UI
Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender
What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox?
In general: design layer between XML and user
![Page 67: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/67.jpg)
Project Suggestions
XML information retrieval using Lucene Address some of the XML IR challenges
Automatic creation of datatypes Weighting Others?
![Page 68: WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11](https://reader036.vdocument.in/reader036/viewer/2022081520/56649f065503460f94c1c8ba/html5/thumbnails/68.jpg)
Resources
xquery full text requirements
Other approaches http://www.cs.cornell.edu/database/publicati
ons/2003/sigmod2003-xrank.pdf http://www.ercim.org/publication/ws-procee
dings/DelNoe01/22_Schlieder.pdf
Xml classification http://citeseer.nj.nec.com/583672.html
http://www.w3.org/TR/xquery-full-text-requirements/