cs276a text information retrieval, mining, and exploitation
DESCRIPTION
CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 15 26 Nov 2002. …/~newbie/. www.ibm.com. /…/…/leaf.htm. Recap: Web Anatomy. E2. E1. WEB. Recap:Size of the Web. Capture – Recapture technique Assumes engines get independent random subsets of the Web. - PowerPoint PPT PresentationTRANSCRIPT
CS276AText Information Retrieval, Mining, and
Exploitation
Lecture 1526 Nov 2002
Recap: Web Anatomy
www.ibm.comwww.ibm.com……//~newbie/~newbie/
/…/…/leaf.htm/…/…/leaf.htm
Recap:Size of the Web
Capture – Recapture technique Assumes engines get independent random
subsets of the Web
E2 contains x% of E1.Assume, E2 contains x% of the Web as well
Knowing size of E2 compute size of the WebSize of the Web = 100*E2/x
E1E2
WEB
Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)
Recent Measurements
Source: http://www.searchengineshowdown.com/stats/change.shtml
Today’s Topics
Web IR infrastructure Search deployment XML intro XML indexing and search
Web IR Infrastructure
Connectivity Server Fast access to links to support link analysis
Term Vector Database Fast access to document vectors to
augment link analysis
Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]
Fast web graph access to support connectivity analysis
Stores mappings in memory from URL to outlinks, URL to inlinks
Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. Visualizations
Usage
Input
Graphalgorithm
+URLs
+Values
URLstoFPstoIDs
Execution
Graphalgorithm
runs inmemory
IDstoURLs
Output
URLs+
Values
Translation Tables on DiskURL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytesID(32b) -> FP(64b): 8 bytesID(32b) -> URLs: 0.5 bytes
ID assignment
Partition URLs into 3 sets, sorted lexicographically
High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)
IDs assigned in sequence (densely)
E.g., HIGH IDs:
Max(indegree , outdegree) > 254
ID URL
…
9891 www.amazon.com/
9912 www.amazon.com/jobs/
…
9821878 www.geocities.com/
…
40930030 www.google.com/
…
85903590 www.yahoo.com/
Adjacency lists In memory tables for
Outlinks, Inlinks List index maps from an ID
to start of adjacency list
Adjacency List Compression - I
…
…
9813215398
147153
…
…
104105106
ListIndex
Sequenceof
AdjacencyLists
…
…
-63421-8496
…
…
104105106
ListIndex
DeltaEncoded
AdjacencyLists
• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host)- Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low- Avg = 12b per pointer
List Index Pointers
URL Info
LC:TID
LC:TID
…
LC:TID
FRQ:RL
FRQ:RL
…
FRQ:RL
Base (4 bytes)
Offsets For 16
IDs
offset
ID to adjacency list lookup
ID
Adjacencylists
Adjacency List Compression - II
Inter List Compression Basis: Similar URLs may share links
Close in ID space => adjacency lists may overlap Approach
Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block
Represent adjacency list in terms of deletions and additions when it is cheaper to do so
Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB
RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
Term Vector Database[Stat00]
Fast access to 50 word term vectors for web pages Term Selection:
Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF.
Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length)
Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification
Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
block)
Architecture
URL Info
LC:TID
LC:TID
…
LC:TID
FRQ:RL
FRQ:RL
…
FRQ:RL
128ByteTV
Record
Terms
Freq
Base (4 bytes)
Bit vectorFor
480 URLids
offset
URLid to Term Vector Lookup
URLid * 64 /480
Search Deployment
Web IR is just one (very specific) type of IR Commercially most important IR
application: Enterprise search (large corporations) Problem different from Web IR
Peer-2-Peer (P2P) search Another search deployment strategy
Enterprise Search Deployment
DatabaseCorporate
Network
Company
Web Site
E-Commerce Web PortalsEnterprises
Proprietary content Public content
World Wide
WebSources
Markets
SearchBoxes
Content
Location
Content
ManagementGroupware
1st Generation:
Classic Information Retrieval
2nd Generation:
Driven by WWW
3rd Generation:
Discovery(Text Mining)
User: Trained specialist Everyone Everyone and software agents
Scope: Small, closed collections Intranet/ExtranetStructured, semi-structured and unstructured information
Technology: Pattern/string matchingPattern/string matching and external factors for relevance ranking + categorization
Introduction of linguistic and semantic processing
1985 - 1993 1994 - 1999 2000+
Evolution of Enterprise Search
Enterprise IR is a lot more than search …
Security Cannot search what you
should not readContent organization & creation
Automatic classification Taxonomy generation Support for multiple
languages, multiple formats
Conduits into databases and other content management -- homes for “valuable” content
Information processing tools
Annotation Range searches Custom ranking
criteria Cross lingual tools,…
Individual preferences Personalization Notification, …
Peer-To-Peer (P2P) Search
No central index Each node in a network builds and
maintains own index Each node has “servent” software
On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests
Which hosts to connect to?
The ones you connected to last time Random hosts you know of Request suggestions from central (or
hierarchical) nameservers
All govern system’s shape and efficiency
Serving P2P search requests
Send your request to your neighbors They send it to their neighbors
decrement “time to live” for query query dies when ttl = 0
Send search matches back along requesting path
Some P2P Networks
Gnutella Kazaa Bearshare Aimster Grokster Morpheus
P2P: Information Retrieval Issues
Why is this more difficult than centralized IR?
P2P: Information Retrieval Issues
Selection of nodes to query Merging of results Spam
What is XML?
eXtensible Markup Language A framework for defining markup
languages No fixed collection of markup tags Each XML language targeted for
application All XML languages share features Enables building of generic tools
Basic Structure
An XML document is an ordered, labeled tree
character data leaf nodes contain the actual data (text strings) data nodes must be non-empty and non-
adjacent to other character data nodes element nodes, are each labeled with
a name (often called the element type), and a set of attributes, each consisting of a
name and a value, can have child nodes
XML Example
XML Example
<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>
Elements
Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>
XML vs HTML
Relationship?
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers)
XML is a framework for defining markup languages
HTML can be formalized as an XML language (XHTML)
XML defines logical structure only HTML: same intention, but has evolved into
a presentation language
XML: Design Goals
Separate syntax from semantics to provide a common framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the future of (semi)structured information (do some of the work now done by databases)
Why Use XML?
Represent semi-structured data (data that are structured, but don’t fit relational model)
XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free
Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language
<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>
XML Schemas
Schema = syntax definition of XML language
Schema language = formal language for expressing XML schemas
Examples DTD XML Schema (W3C)
Relevance for XML IR Our job is much easier if we have a (one)
schema
XML Tutorial
http://www.brics.dk/~amoeller/XML/index.html
(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are
based on their tutorial
XML Indexing and Search
Native XML Database
Uses XML document as logical unit Should support
Elements Attributes PCDATA (parsed character data) Document order
Contrast with DB modified for XML Generic IR system modified for XML
XML Indexing and Search
Most native XML databases have taken an DB approach Exact match Evaluate path expressions No IR type relevance ranking
Only a few that focus on relevance ranking
Timber: XML as DB extension
DB: search tuples Timber: search trees Main focus
Complex and variable structure of trees (vs. tuples)
Ordering XML query optimization vs relational
optimization
ToXin
Native XML database Exploits overall path structure
Supports any general path query Query evaluation in three stages
Preselection stage Selection stage Postselection stage
ToXin: Motivation
Strawman: Index all paths
occurring in database Does not allow
backward navigation Example query:
find all the titles of articles from 1990
Query Evaluation Stages
Pre-selection First navigation down the tree
Selection Value selection according to filter
Post-selection Navigation up and down again
ToXin
Factors Impacting Performance
Data source specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size
Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter
Benchmark Parameters
Query Classification
Evaluation
ToXin: Summary
Efficient native XML database All paths are indexed (not just from root) Path index linear in corpus size Shortcomings
Order of nodes ignored Semantics of IDRefs ignored
What ismissing?
IR/Relevance Ranking for XML
Why is this difficult?
IR XML Challenge 1: Term Statistics
There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity
IR XML Challenge 2: Fragments
IR systems don’t store content (only index) Need to go to document for displaying
fragment Easier in DB framework
Relevance Ranking for XML
Will revisit next week
Querying XML
Semistructured queries XPath XQuery
Types of (Semi)Structured Queries
Location/position (“chapter no.3”) Simple attribute/value
/play/title contains “hamlet” Path queries
title contains “hamlet” /play//title contains “hamlet”
Complex graphs Employees with two managers
All of the above: mixed structure/content Subsumes: hyperlinks
XPath
Declarative language for Addressing (used in XLink/XPointer and in
XSLT) Pattern matching (used in XSLT and in
XQuery) Location path
a sequence of location steps separated by /
Example: child::section[position()<6] /
descendant::cite / attribute::href
Axes in XPath
ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self
Location steps
A single location step has the form: axis :: node-test [ predicate ]
The axis selects a rough set of candidate nodes (e.g. the child nodes of the context node).
The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction,
etc.), or names (e.g. element name).
The predicates (zero or more) cause a further, potentially more complex, filtration
child::section[position()<6]
XQuery
SQL for XML Usage scenarios
Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)
Relies on XPath XML Schema datatypes
Turing complete XQuery is still a working draft. More than a hundred open issues as of 2002.11.10
XQuery
The principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions
Evaluated with respect to a context
FLWR
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
XQuery vs SQL
Order matters! document("zoo.xml")//chapter[2]//
figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.
XQuery Example
Møller and Schwartzbach
XQuery Standard on Ranking (2.3.1)
Document order defines a total ordering among all the nodes seen by the language processor. Within a given document, the document node is the first node, followed by element nodes, text nodes, comment nodes, and processing instruction nodes in the order of their representation in the XML form of the document (after expansion of entities). Element nodes occur before their children. The namespace nodes of an element immediately follow the element node, in implementation-defined order. The attribute nodes of an element immediately follow its namespace nodes, and are also in implementation-defined order.
The relative order of nodes in distinct documents is implementation-defined but stable within a given query or transformation. In other words, given two distinct documents A and B, if a node in document A is before a node in document B, then every node in document A is before every node in document B. The relative order among free-floating nodes (those not in a document) is implementation-defined.
Next Week (12/3)
XML indexing and search II Metadata indexing and search Dublin Core, RDF, DAML+OIL