cs276a text information retrieval, mining, and exploitation lecture 16 3 dec 2002

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 163 Dec 2002

Today’s Topics

Recap: XQuery Course evaluations XML indexing and search II

Systems supporting relevance ranking UC Davis system (XQuery extension) XIRQL: Relevance ranking / XML search Summary, discussion

Metadata

Recap: XQuery

Møller and Schwartzbach

Queries Supported by XQuery

Location/position (“chapter no.3”) Attribute/value

/play/title contains “hamlet” Path queries

title contains “hamlet” /play//title contains “hamlet”

Complex graphs Ingredients occurring in two recipes

Subsumes: hyperlinks What about relevance ranking?

XQuery 1.0 Standard on Order

Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model.

… if a node in document A is before a node in document B, then every node in document A is before every node in document B.

Document Collection = 1 XML Doc.

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation><MedlineID>91060009</MedlineID><DateCreated><Year>1991</Year><Month>01</

Month><Day>10</Day></DateCreated>

<Article>some content</Article></MedlineCitation>

Document Collection = 1 XML Doc.

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation>…</MedlineCitationSet>

How XQuery makes ranking difficult

All documents in collection A must be ranked before all documents in collection B.

Fragments must be ordered in depth-first, left-to-right order.

XQuery: Order By Clause

F for $d in document("depts.xml")//deptnoL let $e :=

document("emps.xml")//emp[deptno = $d] W where count($e) >= 10 O order by avg($e/salary) descending R return <big-dept> { $d,

<headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>

XQuery Order By Clause

Order by clause only allows ordering by “overt” criterion

Relevance ranking Is often proprietary Can’t be expressed easily as function of set

to be ranked Is better abstracted out of query formulation

(cf. www)

Next: XML IR System with Relevance Ranking

Thursday (12/5): Presentation of Projects

Copy slides into your subdirectory in the cs276a submit/ directory

Slides must be named slides.{ppt,pdf} Slides must be submitted by noon

tomorrow (12/4) You have 3 minutes to present on

Thursday Use all of it Or leave 1 minute for 1 question

Projects are due today

Evaluations

UC Davis System

Extension of XQuery for relevance ranking Term counts are stored with only one

related node (usually leaf) Term weight computation at query time Solves “granularity problem” Reasonable index size

30% - 60% of collection

Integrated Query Processing

Abbreviations

DF = Document Fragment DFS = Document Fragment Sequence SSD = sequence of sets of DFs

Element vs Attribute

Search for element /play[title=“hamlet”]

Search for attribute /play[@title=“hamlet”]

Both are valid encodings -> need to know schema

UC Davis system treats elements/attributes identically for text search

XIRQL supports /play[~title=“hamlet”]

Proposed XQuery Extension

RankBy ::=Expr “rankby” “(” QuerySpecList “)”[“basedon” “(“ TargetSpecList “)”][“using” MethodFunctCall]

QuerySpecList ::= Expr (“,” QuerySpecList)?TargetSpecList ::=PathExpr (“,” TargetSpecList)?

Proposed XQuery Extension: Example

document(“news.xml”)//article//paragraph rankby (“New York”)

for $c in document(“newsmeta.xml”)//category return <category> <name> {$c/name[text()} </name> For $a in document

(“news.xml”)//article[@icd=$c/@id] Return <title> $a/title/text() </title> Rankby ($c/keywords) </category>

Proposed XQuery Extension: Example

For $a in document(“bib.xml”)//article Return <paper> {$a/title} </paper> Rankby (“albert”,”einstein”) basedon

(references)

Observations

Possible: Ranked text != Returned text Rank on references Return titles

Query can be user supplied or path expression $c/keywords example

Implementation

Term weighting Term weights are computed on the fly

Index construction

Term Weighting

Parameters we need for computing term weights: tf – term frequency in DF df – number of DFs containing term dlen – length of DF in terms slen – number of terms in DFS

Premise: First step of query evaluation is p query

So term weights can be computed on the fly while evaluating p query

Index Requirements

Inverted index Non-replication of statistics DF inclusion DF intersection

Worst case complexity Quadratic Number of DFs in current result set x

number DFs contributing to term statistics

Document Guide

Node Encoding

Use Document Guide Encode each node with path identifier (PID) PID = (DG-index,[path number,path

number,…]) Example:

/db/article[1]/text/para[3] Encoded as: (5,(1,3))

Encode DG-index with log(n) bits Encode each path number with log(ki) bits

Node Encoding

Term Counter Accumulation

Worst case: quadratic Trick: order PIDs to reduce set of DFs to

check (n1,N1) < (n2,N2) iff n1<n2 or (n1=n2 and

N1<N2) To find all descendants of (n1,N1)

Start with (n1,N1) Proceed to first node (n2,N2) that is not a

descendant Stop

Index Parameters

Performance

Limitations

Need to start with “p query” (a path) Often inconvenient / not possible User must know schema

No element/attribute distinction for r queries

XQuery allows construction of new fragments from subfragments

XQuery allows construction of completely new documents

XIRQL

University of Dortmund Goal: open source XML search engine

Motivation “Returnable” fragments are special

“atomic units” E.g., don’t return a <bold> some text </bold>

fragment Structured Document Retrieval Principle Empower users who don’t know the schema

Enable search for any person_name no matter how schema refers to it

Atomic Units

Specified in schema Only atomic units can be returned as result

of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence”

from atomic units

XIRQL Indexing

Structured Document Retrieval Principle

A system should always retrieve the most specific part of a document answering a query.

Example query: xql Document:

<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>

Return section, not chapter

Augmentation weights

Ensure that Structured Document Retrieval Principle is respected.

Assume different query conditions are disjoint events -> independence.

P(chapter,XQL) =P(XQL|chapter)+P(sec.|chapter)*P(XQL|sec.)-P(XQL|chapter)*P(sec.|chapter)*P(XQL|sec.)

= 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

Section ranked ahead of chapter

Datatypes

Example: person_name Assign all elements and attributes with

person semantics to this datatype Allow user to search for “person” without

specifying path

XIRQL: Summary

Relevance ranking Fragment/context selection Datatypes (person_name) Probablistic combination of evidence

XML Summary

Why you don’t want to use a DB Relevance ranking

Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity,

coherence …)

XML Summary: Schemas

Ideally: There is one schema User understands schema

In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas

Need to identify similar elements in different schemas Example: employee

XML Summary: UI Challenges

Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender

What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox?

In general: design layer between XML and user

Metadata

Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g.,

http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University)

Why metadata?

“Normalized” semantics Enables searches otherwise not possible:

Time Author Url / filename

Non-text content Images Audio Video

For Effective Metadata We Need:

Semantics Commonly understood terms to describe

information resources Syntax

Standard grammar for connecting terms into meaningful “sentences”

Exchange framework So we can recombine and exchange

metadata across applications and subjects

Weibel and Miller

Dublin Core: Goals

Metadata standard Framework for interoperability Facilitate development of subject specific

metadata More general goals of DC community

Foster development of metadata tools Creating, editing, managing, navigating metadata

Semantic registry for metadata Search declared meanings and their relationships

Registry for metadata vocabularies Maybe somebody has already done the work

RDF =Resource Description Framework

Engineering standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata

Combine different metadata modules (e.g., different subject areas)

Syndication, aggregation, threading

Metadata: Who creates them?

Authors Editors Automatic creation Automatic context capture

Automatic Context Capture

Metadata Pros and Cons CONS

Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author

Authors are unable to foresee all reasons why a document may be interesting.

Authors may be motivated to sabotage metadata (patents). PROS

Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata.

Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user

experience But most interesting content will always have inconsistent

and spotty metadata coverage

Advanced Research Issues

Cross-lingual IR Topic detection & tracking Information filtering and collaborative

filtering Data fusion Automatic classification Merging database and IR technologies Digital libraries Multimedia information retrieval

Resources

Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer

www.w3.org/XML - XML resources at W3C Ronald Bourret on native XML databases:

http://www.rpbourret.com/xml/ProdsNative.htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query

language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.

http://www.sciam.com/2001/0501issue/0501berners-lee.html

http://www.xml.com/pub/a/2001/01/24/rdf.html

Additional Resources

http://www.hyperorg.com/backissues/joho-jun26-02.html#semantic

http://www.xml.com/pub/a/2001/05/23/jena.html

http://www710.univ-lyon1.fr/%7Echampin/rdf-tutorial/

http://www-106.ibm.com/developerworks/library/w-rdf/

cs276a text information retrieval, mining, and exploitation lecture 16 3 dec 2002

Documents