cs276a text information retrieval, mining, and exploitation lecture 16 3 dec 2002

55
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Upload: elizabeth-sanders

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 163 Dec 2002

Page 2: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Today’s Topics

Recap: XQuery Course evaluations XML indexing and search II

Systems supporting relevance ranking UC Davis system (XQuery extension) XIRQL: Relevance ranking / XML search Summary, discussion

Metadata

Page 3: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Recap: XQuery

Møller and Schwartzbach

Page 4: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Queries Supported by XQuery

Location/position (“chapter no.3”) Attribute/value

/play/title contains “hamlet” Path queries

title contains “hamlet” /play//title contains “hamlet”

Complex graphs Ingredients occurring in two recipes

Subsumes: hyperlinks What about relevance ranking?

Page 5: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XQuery 1.0 Standard on Order

Document order defines a total ordering among all the nodes seen by the language processor. Informally, document order corresponds to a depth-first, left-to-right traversal of the nodes in the Data Model.

… if a node in document A is before a node in document B, then every node in document A is before every node in document B.

Page 6: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Document Collection = 1 XML Doc.

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation><MedlineID>91060009</MedlineID><DateCreated><Year>1991</Year><Month>01</

Month><Day>10</Day></DateCreated>

<Article>some content</Article></MedlineCitation>

Page 7: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Document Collection = 1 XML Doc.

<!DOCTYPE MedlineCitationSetPUBLIC "MedlineCitationSet""http://www.nlm.nih.gov/databases/dtd/

nlmmedline_001211.dtd"><MedlineCitationSet><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation><MedlineCitation> (content) </MedlineCitation>…</MedlineCitationSet>

Page 8: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

How XQuery makes ranking difficult

All documents in collection A must be ranked before all documents in collection B.

Fragments must be ordered in depth-first, left-to-right order.

Page 9: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XQuery: Order By Clause

F for $d in document("depts.xml")//deptnoL let $e :=

document("emps.xml")//emp[deptno = $d] W where count($e) >= 10 O order by avg($e/salary) descending R return <big-dept> { $d,

<headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept>

Page 10: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XQuery Order By Clause

Order by clause only allows ordering by “overt” criterion

Relevance ranking Is often proprietary Can’t be expressed easily as function of set

to be ranked Is better abstracted out of query formulation

(cf. www)

Page 11: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Next: XML IR System with Relevance Ranking

Page 12: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Thursday (12/5): Presentation of Projects

Copy slides into your subdirectory in the cs276a submit/ directory

Slides must be named slides.{ppt,pdf} Slides must be submitted by noon

tomorrow (12/4) You have 3 minutes to present on

Thursday Use all of it Or leave 1 minute for 1 question

Projects are due today

Page 13: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Evaluations

Page 14: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

UC Davis System

Extension of XQuery for relevance ranking Term counts are stored with only one

related node (usually leaf) Term weight computation at query time Solves “granularity problem” Reasonable index size

30% - 60% of collection

Page 15: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Integrated Query Processing

Page 16: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Abbreviations

DF = Document Fragment DFS = Document Fragment Sequence SSD = sequence of sets of DFs

Page 17: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Integrated Query Processing

Page 18: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Element vs Attribute

Search for element /play[title=“hamlet”]

Search for attribute /play[@title=“hamlet”]

Both are valid encodings -> need to know schema

UC Davis system treats elements/attributes identically for text search

XIRQL supports /play[~title=“hamlet”]

Page 19: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Proposed XQuery Extension

RankBy ::=Expr “rankby” “(” QuerySpecList “)”[“basedon” “(“ TargetSpecList “)”][“using” MethodFunctCall]

QuerySpecList ::= Expr (“,” QuerySpecList)?TargetSpecList ::=PathExpr (“,” TargetSpecList)?

Page 20: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Proposed XQuery Extension: Example

document(“news.xml”)//article//paragraph rankby (“New York”)

for $c in document(“newsmeta.xml”)//category return <category> <name> {$c/name[text()} </name> For $a in document

(“news.xml”)//article[@icd=$c/@id] Return <title> $a/title/text() </title> Rankby ($c/keywords) </category>

Page 21: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Proposed XQuery Extension: Example

For $a in document(“bib.xml”)//article Return <paper> {$a/title} </paper> Rankby (“albert”,”einstein”) basedon

(references)

Page 22: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Observations

Possible: Ranked text != Returned text Rank on references Return titles

Query can be user supplied or path expression $c/keywords example

Page 23: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Implementation

Term weighting Term weights are computed on the fly

Index construction

Page 24: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Term Weighting

Parameters we need for computing term weights: tf – term frequency in DF df – number of DFs containing term dlen – length of DF in terms slen – number of terms in DFS

Premise: First step of query evaluation is p query

So term weights can be computed on the fly while evaluating p query

Page 25: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Integrated Query Processing

Page 26: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Index Requirements

Inverted index Non-replication of statistics DF inclusion DF intersection

Worst case complexity Quadratic Number of DFs in current result set x

number DFs contributing to term statistics

Page 27: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Document Guide

Page 28: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Node Encoding

Use Document Guide Encode each node with path identifier (PID) PID = (DG-index,[path number,path

number,…]) Example:

/db/article[1]/text/para[3] Encoded as: (5,(1,3))

Encode DG-index with log(n) bits Encode each path number with log(ki) bits

Page 29: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Node Encoding

Page 30: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Term Counter Accumulation

Worst case: quadratic Trick: order PIDs to reduce set of DFs to

check (n1,N1) < (n2,N2) iff n1<n2 or (n1=n2 and

N1<N2) To find all descendants of (n1,N1)

Start with (n1,N1) Proceed to first node (n2,N2) that is not a

descendant Stop

Page 31: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Index Parameters

Page 32: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Performance

Page 33: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Limitations

Need to start with “p query” (a path) Often inconvenient / not possible User must know schema

No element/attribute distinction for r queries

XQuery allows construction of new fragments from subfragments

XQuery allows construction of completely new documents

Page 34: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XIRQL

University of Dortmund Goal: open source XML search engine

Motivation “Returnable” fragments are special

“atomic units” E.g., don’t return a <bold> some text </bold>

fragment Structured Document Retrieval Principle Empower users who don’t know the schema

Enable search for any person_name no matter how schema refers to it

Page 35: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Atomic Units

Specified in schema Only atomic units can be returned as result

of search (unless unit specified) Tf.idf weighting is applied to atomic units Probabilistic combination of “evidence”

from atomic units

Page 36: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XIRQL Indexing

Page 37: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Structured Document Retrieval Principle

A system should always retrieve the most specific part of a document answering a query.

Example query: xql Document:

<chapter> 0.3 XQL<section> 0.5 example </section><section> 0.8 XQL 0.7 syntax </section></chapter>

Return section, not chapter

Page 38: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Augmentation weights

Ensure that Structured Document Retrieval Principle is respected.

Assume different query conditions are disjoint events -> independence.

P(chapter,XQL) =P(XQL|chapter)+P(sec.|chapter)*P(XQL|sec.)-P(XQL|chapter)*P(sec.|chapter)*P(XQL|sec.)

= 0.3+0.6*0.8-0.3*0.6*0.8 = 0.636

Section ranked ahead of chapter

Page 39: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Datatypes

Example: person_name Assign all elements and attributes with

person semantics to this datatype Allow user to search for “person” without

specifying path

Page 40: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XIRQL: Summary

Relevance ranking Fragment/context selection Datatypes (person_name) Probablistic combination of evidence

Page 41: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XML Summary

Why you don’t want to use a DB Relevance ranking

Why you can’t use a standard IR engine Term statistics / indexing granularity Issues with fragments (granularity,

coherence …)

Page 42: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XML Summary: Schemas

Ideally: There is one schema User understands schema

In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas

Need to identify similar elements in different schemas Example: employee

Page 43: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

XML Summary: UI Challenges

Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender

What is the query language you expose to user? XQuery? No. Forms? Parametric search? A textbox?

In general: design layer between XML and user

Page 44: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Metadata

Page 45: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Dublin Core Element Set Title (e.g., Dublin Core Element Set) Creator (e.g., Hinrich Schuetze) Subject (e.g, keywords) Description (e.g., an abstract) Publisher (e.g., Stanford University) Contributor (e.g., Chris Manning) Date (e.g, 2002.12.03) Type (e.g., presentation) Format (e.g., ppt) Identifier (e.g.,

http://www.stanford.edu/class/cs276a/syllabus.html) Source (e.g. http://dublincore.org/documents/dces/) Language (e.g, English) Coverage (e.g., San Francisco Bay Area) Rights (e.g., Copyright Stanford University)

Page 46: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Why metadata?

“Normalized” semantics Enables searches otherwise not possible:

Time Author Url / filename

Non-text content Images Audio Video

Page 47: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

For Effective Metadata We Need:

Semantics Commonly understood terms to describe

information resources Syntax

Standard grammar for connecting terms into meaningful “sentences”

Exchange framework So we can recombine and exchange

metadata across applications and subjects

Weibel and Miller

Page 48: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Dublin Core: Goals

Metadata standard Framework for interoperability Facilitate development of subject specific

metadata More general goals of DC community

Foster development of metadata tools Creating, editing, managing, navigating metadata

Semantic registry for metadata Search declared meanings and their relationships

Registry for metadata vocabularies Maybe somebody has already done the work

Page 49: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

RDF =Resource Description Framework

Engineering standard for metadata W3C standard Part of W3C’s metadata framework Specialized for WWW Desiderata

Combine different metadata modules (e.g., different subject areas)

Syndication, aggregation, threading

Page 50: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Metadata: Who creates them?

Authors Editors Automatic creation Automatic context capture

Page 51: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Automatic Context Capture

Page 52: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Metadata Pros and Cons CONS

Most authors are unwilling to spend time and energy on learning a metadata standard annotating documents they author

Authors are unable to foresee all reasons why a document may be interesting.

Authors may be motivated to sabotage metadata (patents). PROS

Information retrieval often does not work. Words poorly approximate meaning. For truly valuable content, it pays to add metadata.

Synthesis In reality, most documents have some valuable metadata If metadata is available, it improves relevance and user

experience But most interesting content will always have inconsistent

and spotty metadata coverage

Page 53: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Advanced Research Issues

Cross-lingual IR Topic detection & tracking Information filtering and collaborative

filtering Data fusion Automatic classification Merging database and IR technologies Digital libraries Multimedia information retrieval

Page 54: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Resources

Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer

www.w3.org/XML - XML resources at W3C Ronald Bourret on native XML databases:

http://www.rpbourret.com/xml/ProdsNative.htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query

language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.

http://www.sciam.com/2001/0501issue/0501berners-lee.html

http://www.xml.com/pub/a/2001/01/24/rdf.html

Page 55: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002

Additional Resources

http://www.hyperorg.com/backissues/joho-jun26-02.html#semantic

http://www.xml.com/pub/a/2001/05/23/jena.html

http://www710.univ-lyon1.fr/%7Echampin/rdf-tutorial/

http://www-106.ibm.com/developerworks/library/w-rdf/