cs276a text information retrieval, mining, and exploitation

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 1526 Nov 2002

Recap: Web Anatomy

www.ibm.comwww.ibm.com……//~newbie/~newbie/

/…/…/leaf.htm/…/…/leaf.htm

Recap:Size of the Web

Capture – Recapture technique Assumes engines get independent random

subsets of the Web

E2 contains x% of E1.Assume, E2 contains x% of the Web as well

Knowing size of E2 compute size of the WebSize of the Web = 100*E2/x

E1E2

WEB

Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)

Recent Measurements

Source: http://www.searchengineshowdown.com/stats/change.shtml

Today’s Topics

Web IR infrastructure Search deployment XML intro XML indexing and search

Web IR Infrastructure

Connectivity Server Fast access to links to support link analysis

Term Vector Database Fast access to document vectors to

augment link analysis

Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]

Fast web graph access to support connectivity analysis

Stores mappings in memory from URL to outlinks, URL to inlinks

Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. Visualizations

Usage

Input

Graphalgorithm

+URLs

+Values

URLstoFPstoIDs

Execution

Graphalgorithm

runs inmemory

IDstoURLs

Output

URLs+

Values

Translation Tables on DiskURL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytesID(32b) -> FP(64b): 8 bytesID(32b) -> URLs: 0.5 bytes

ID assignment

Partition URLs into 3 sets, sorted lexicographically

High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)

IDs assigned in sequence (densely)

E.g., HIGH IDs:

Max(indegree , outdegree) > 254

ID URL

…

9891 www.amazon.com/

9912 www.amazon.com/jobs/

…

9821878 www.geocities.com/

…

40930030 www.google.com/

…

85903590 www.yahoo.com/

Adjacency lists In memory tables for

Outlinks, Inlinks List index maps from an ID

to start of adjacency list

Adjacency List Compression - I

…

…

9813215398

147153

…

…

104105106

ListIndex

Sequenceof

AdjacencyLists

…

…

-63421-8496

…

…

104105106

ListIndex

DeltaEncoded

AdjacencyLists

• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host)- Compress deltas with variable length encoding (e.g., Huffman)

• List Index pointers: 32b for high, Base+16b for med, Base+8b for low- Avg = 12b per pointer

List Index Pointers

URL Info

LC:TID

LC:TID

…

LC:TID

FRQ:RL

FRQ:RL

…

FRQ:RL

Base (4 bytes)

Offsets For 16

IDs

offset

ID to adjacency list lookup

ID

Adjacencylists

Adjacency List Compression - II

Inter List Compression Basis: Similar URLs may share links

Close in ID space => adjacency lists may overlap Approach

Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block

Represent adjacency list in terms of deletions and additions when it is cheaper to do so

Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB

RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

Term Vector Database[Stat00]

Fast access to 50 word term vectors for web pages Term Selection:

Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF.

Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length)

Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification

Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk

block)

Architecture

URL Info

LC:TID

LC:TID

…

LC:TID

FRQ:RL

FRQ:RL

…

FRQ:RL

128ByteTV

Record

Terms

Freq

Base (4 bytes)

Bit vectorFor

480 URLids

offset

URLid to Term Vector Lookup

URLid * 64 /480

Search Deployment

Web IR is just one (very specific) type of IR Commercially most important IR

application: Enterprise search (large corporations) Problem different from Web IR

Peer-2-Peer (P2P) search Another search deployment strategy

Enterprise Search Deployment

DatabaseCorporate

Network

Company

Web Site

E-Commerce Web PortalsEnterprises

Proprietary content Public content

World Wide

WebSources

Markets

SearchBoxes

Content

Location

Content

ManagementGroupware

1st Generation:

Classic Information Retrieval

2nd Generation:

Driven by WWW

3rd Generation:

Discovery(Text Mining)

User: Trained specialist Everyone Everyone and software agents

Scope: Small, closed collections Intranet/ExtranetStructured, semi-structured and unstructured information

Technology: Pattern/string matchingPattern/string matching and external factors for relevance ranking + categorization

Introduction of linguistic and semantic processing

1985 - 1993 1994 - 1999 2000+

Evolution of Enterprise Search

Enterprise IR is a lot more than search …

Security Cannot search what you

should not readContent organization & creation

Automatic classification Taxonomy generation Support for multiple

languages, multiple formats

Conduits into databases and other content management -- homes for “valuable” content

Information processing tools

Annotation Range searches Custom ranking

criteria Cross lingual tools,…

Individual preferences Personalization Notification, …

Peer-To-Peer (P2P) Search

No central index Each node in a network builds and

maintains own index Each node has “servent” software

On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests

Which hosts to connect to?

The ones you connected to last time Random hosts you know of Request suggestions from central (or

hierarchical) nameservers

All govern system’s shape and efficiency

Serving P2P search requests

Send your request to your neighbors They send it to their neighbors

decrement “time to live” for query query dies when ttl = 0

Send search matches back along requesting path

Some P2P Networks

Gnutella Kazaa Bearshare Aimster Grokster Morpheus

P2P: Information Retrieval Issues

Why is this more difficult than centralized IR?

P2P: Information Retrieval Issues

Selection of nodes to query Merging of results Spam

What is XML?

eXtensible Markup Language A framework for defining markup

languages No fixed collection of markup tags Each XML language targeted for

application All XML languages share features Enables building of generic tools

Basic Structure

An XML document is an ordered, labeled tree

character data leaf nodes contain the actual data (text strings) data nodes must be non-empty and non-

adjacent to other character data nodes element nodes, are each labeled with

a name (often called the element type), and a set of attributes, each consisting of a

name and a value, can have child nodes

XML Example

XML Example

<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>

Elements

Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>

XML vs HTML

Relationship?

XML vs HTML

HTML is a markup language for a specific purpose (display in browsers)

XML is a framework for defining markup languages

HTML can be formalized as an XML language (XHTML)

XML defines logical structure only HTML: same intention, but has evolved into

a presentation language

XML: Design Goals

Separate syntax from semantics to provide a common framework for structuring information

Allow tailor-made markup for any imaginable application domain

Support internationalization (Unicode) and platform independence

Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML?

Represent semi-structured data (data that are structured, but don’t fit relational model)

XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language

<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>

XML Schemas

Schema = syntax definition of XML language

Schema language = formal language for expressing XML schemas

Examples DTD XML Schema (W3C)

Relevance for XML IR Our job is much easier if we have a (one)

schema

XML Tutorial

http://www.brics.dk/~amoeller/XML/index.html

(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are

based on their tutorial



XML Indexing and Search

Native XML Database

Uses XML document as logical unit Should support

Elements Attributes PCDATA (parsed character data) Document order

Contrast with DB modified for XML Generic IR system modified for XML

XML Indexing and Search

Most native XML databases have taken an DB approach Exact match Evaluate path expressions No IR type relevance ranking

Only a few that focus on relevance ranking

Timber: XML as DB extension

DB: search tuples Timber: search trees Main focus

Complex and variable structure of trees (vs. tuples)

Ordering XML query optimization vs relational

optimization

ToXin

Native XML database Exploits overall path structure

Supports any general path query Query evaluation in three stages

Preselection stage Selection stage Postselection stage

ToXin: Motivation

Strawman: Index all paths

occurring in database Does not allow

backward navigation Example query:

find all the titles of articles from 1990

Query Evaluation Stages

Pre-selection First navigation down the tree

Selection Value selection according to filter

Post-selection Navigation up and down again

Factors Impacting Performance

Data source specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size

Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter

Benchmark Parameters

Query Classification

Evaluation

ToXin: Summary

Efficient native XML database All paths are indexed (not just from root) Path index linear in corpus size Shortcomings

Order of nodes ignored Semantics of IDRefs ignored

What ismissing?

IR/Relevance Ranking for XML

Why is this difficult?

IR XML Challenge 1: Term Statistics

There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity

IR XML Challenge 2: Fragments

IR systems don’t store content (only index) Need to go to document for displaying

fragment Easier in DB framework

Relevance Ranking for XML

Will revisit next week

Querying XML

Semistructured queries XPath XQuery

Types of (Semi)Structured Queries

Location/position (“chapter no.3”) Simple attribute/value

/play/title contains “hamlet” Path queries

title contains “hamlet” /play//title contains “hamlet”

Complex graphs Employees with two managers

All of the above: mixed structure/content Subsumes: hyperlinks

XPath

Declarative language for Addressing (used in XLink/XPointer and in

XSLT) Pattern matching (used in XSLT and in

XQuery) Location path

a sequence of location steps separated by /

Example: child::section[position()<6] /

descendant::cite / attribute::href

Axes in XPath

ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

Location steps

A single location step has the form: axis :: node-test [ predicate ]

The axis selects a rough set of candidate nodes (e.g. the child nodes of the context node).

The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction,

etc.), or names (e.g. element name).

The predicates (zero or more) cause a further, potentially more complex, filtration

child::section[position()<6]

XQuery

SQL for XML Usage scenarios

Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)

Relies on XPath XML Schema datatypes

Turing complete XQuery is still a working draft. More than a hundred open issues as of 2002.11.10

XQuery

The principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions

Evaluated with respect to a context

FLWR

FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p

FOR generates an ordered list of bindings of publisher names to $p

LET associates to each binding a further binding of the list of book elements with that publisher to $b

at this stage, we have an ordered list of tuples of bindings: ($p,$b)

WHERE filters that list to retain only the desired tuples

RETURN constructs for each tuple a resulting value

XQuery vs SQL

Order matters! document("zoo.xml")//chapter[2]//

figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.

XQuery Example

Møller and Schwartzbach

XQuery Standard on Ranking (2.3.1)

Document order defines a total ordering among all the nodes seen by the language processor. Within a given document, the document node is the first node, followed by element nodes, text nodes, comment nodes, and processing instruction nodes in the order of their representation in the XML form of the document (after expansion of entities). Element nodes occur before their children. The namespace nodes of an element immediately follow the element node, in implementation-defined order. The attribute nodes of an element immediately follow its namespace nodes, and are also in implementation-defined order.

The relative order of nodes in distinct documents is implementation-defined but stable within a given query or transformation. In other words, given two distinct documents A and B, if a node in document A is before a node in document B, then every node in document A is before every node in document B. The relative order among free-floating nodes (those not in a document) is implementation-defined.

Next Week (12/3)

XML indexing and search II Metadata indexing and search Dublin Core, RDF, DAML+OIL

cs276a text information retrieval, mining, and exploitation

Documents