action 2: mine the web

Enhanced Content Delivery

Action 2: Mine the Web

Industrial Day

Roma, 10 Giugno 2004

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 - Partners

ICAR-CNR, Cosenza

KDD & HPC Labs ISTI-CNR, Pisa

Dipartimento di Informatica, Università di Pisa


Action 2 – Mine the Web

The project: four Work Packages

(Action Coordinator Dott. Fosca Giannotti, ISTI-CNR) Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR)

WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica

Work Package 2.2. Indexing and compression (UNIPI) WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica

Work Package 2.3. Managing Terabytes (ISTI, ICAR) WP Coordinator : Dott. Raffaele Perego, ISTI-CNR

Work Package 2.4. Participatory Search Services (UNIPI) WP Coordinator : Prof. Maria Simi, Dip. Informatica



The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1

The focus is on Delivering Enhanced Web Contents to (Communities of) Users: Exploiting Web Mining to extract knowledge/models that can

be used to enhance efficacy and efficiency of the various phases of the information search process

Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users


Motivations

On-line data grows rapidly: 50+M new pages/day, font: IBM 100+k news, articles/day font: IBM Databases, digital libraries, etc.

Internet use tracking produces additional interesting data: Servers logs, WSE logs, network traffic logs

Goldman Sachs estimates (2002):“between 80 and 90 percent of information on the Internet and corporate networks is unstructured”


Motivations The limits of the current means of access to web contents

are becoming clear Low precision and quality, difficulty of matching users’

subjective relevance over-abundance of low-quality web materiallow covering and freshness

much relevant information in the hidden web ranking mechanisms penalize important pages that enter the

scene Difficulties in

managing size, complexity, heterogeneity identifying Patterns and Trends within huge amounts of

unstructured contents

Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge


Web Mining

User-Centric View (Client-Side) discovery of documents on a subject discovery of semantically related documents or document

segments extraction of relevant knowledge about a subject from

multiple sources

Web Mining: Exploiting Data Mining techniques with data coming from the Web

Data Mining: the process of discovery interesting knowledge from large amount of data stored in databases, data warehouses, or other repositories

Goal: assist users or site owners in finding something useful/interesting/relevant

Owner-Centric View (Server-Side) increasing contact / conversion efficiency (Web marketing) targeted promotion of goods, services, products, ads measuring effectiveness of site content / structure providing dynamic personalized services or content


Web Mining Taxonomy

Web Mining

Web Usage Mining

Web Content Mining

Web Structure

Mining131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/finger.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/logokdd.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:09 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 200 131072131.114.21.41 - - [27/May/2004:19:24:12 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 196608131.114.21.41 - - [27/May/2004:19:24:13 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 338224


Web Mining Applications Web Usage Mining

discovering customer preference and behavior Web personalization / collaborative filtering adaptive Web sites / improving Web site organization e-business intelligence, etc.

Web Content Mining information filtering / knowledge extraction Web document categorization discovery of ontologies on the Web, etc.

Web Structure Mining Finding "Quality" or "authoritative" sites based on linkage and citations

IBM CLEVER project Google

Etc.


Some related projects

WebFountain - IBM WebBase - Stanford DBGroup


WebFountain

World-Wide Web, News

Forums, Weblogs, etc.

Newspapers, Magazines, etc.

Customer Electronic Text WebFountain

Infrastructure

for

Advanced Text Analytics

Finds patterns, trends and

relationships in text

Application Examples:

• Marketing

• Intelligence

• Research

IBM


WebFountain: an infrastructure for Advanced Text Analytics applications

CustomerDBs

3rd

PartyDBs

CustomerDBs

Application Server

CustomersInternet

Intranets

NewsFeeds

Crawler

Crawler

Structured Data

GathererStructured D

ata Gatherer

Data StoreData Store

Information Miners

Information Miners

CommunicationsInfrastructure

Index(es)Index(es)

Cluster Management System

Crawler

Crawler

Structured Data

GathererStructured D

ata Gatherer

Data StoreData Store

Information Miners

Information Miners

CommunicationsInfrastructure

Index(es)Index(es)

Cluster Management System

PROJECT WF INFRASTRUCTURE

½ PetabyeCluster capacity

2,000,000,000 Number of pages in store

25,000,000 Number of pages crawled per day

10,000Number of pages mined per second

3674 Number of 73GB hard drives

1231 Number of CPU’s

250

Number of scientists and researchers who have contributed to WebFountain technology

100 Patents pending

75 Patents issued

70Megabytes/sec traffic coming in from internet

5 minutes, 22 secondsTime to complete query

5Number of countries contributing to technology


WebFountain: Reputation Tracking


WebBaseStanford DBgroup


WebBase Challenges

Scalability crawling archive distribution index construction storage

Consistency freshness versions

Dissemination

Archiving “units” coordination

IP Management copy access link access access control

Hidden Web Topic-Specific

Collection Building


Action 2 – Mine the Web: application scenario

So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group

We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery better search services, better categorization and document

classification services, better question answering services



Ambitious objective:Exploit the combination of Web data about:

USAGE, STRUCTURE, CONTENT

originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge

extraction process from the users point of viewDeveloping solutions:

Innovative w.r.t. the state of the art Appropriate for the Web domain


Virtual Organizations

Virtual CommunityInternet


Tracking Virtual Organizations

Tracking the interaction of the virtual community with internet allows us to collect several interesting information

Network Traffic data provide detailed information about:

Usage Preferred sites, user sessions

Content Accessed Documents

Structure From client sessions we can

build the usage Web subgraph By parsing the documents

retrieved we can build the corresponding link graph

Virtual Community


Tracking Virtual Organizations

Link graph

Traffic graph

Link andTraffic graph

Virtual Community


We need an infrastructure: the Web Object Store (WOS)

A Web Data Management System optimized to efficiently handle content, usage, and structure web dataPurpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our user-centric view Manage large collections of

Web pagesPreprocessed Usage dataStructure data

Collected within our virtual community


Related activities:

- Clustering Emails

- Caching of Documents and of Query results

- Efficient and scalable pattern mining and clustering algorithms

- Enhanced compression methods

- Clustering/categorizing query results snippets

- Clustering XML documents

- Etc.

WOS and related activities

Clustering/Pattern/Classification Web Mining algorithms

Efficient and scalable access methods:

• IXE b-trees, full-text indexes

• search in compressed dataData cleaning, preprocessing, filteringPopulation:

•traffic raw data of our community

•IXE Crawler

•Partecipatory search

Efficient and scalable storage:

• IXE persistent objects

• compression

• distributed architecture

Persistent store of objects Web data management

system for web content, structure and usage data

Management of data at many abstraction levels

Fast development of new applications Easy C++ annotation of

new persistent objects Read and write data in

tables


WOS applications Some innovative applications are currently pursued within

our project: Characterization, on the basis of usage only or usage +

contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites);

crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents

Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance

Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis

Caching and clustering of web search results


WOS population: usage data (WP 2.1)

Many-to-many interactions Inter-site user sessions Massive data

Millions/day HttpRequest ~1 GB/day raw data

We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it) The whole University of Pisa


WOS population: content data (WP 2.4)

Methods to gather contents to populate Web Object Store IXE Crawler Participatory Search System (main activity this year) Hidden Web Search


WOS population: content data (WP 2.4)

IXE crawlerinit

get next url

get page

extract urls

initial urls

web pages

Internet


IXE Crawler

Parallel/distributed crawler High performance through:

asynchronous I/O (500 connections/thread) asynchronous DNS resolution keep-alive connections multi-threads URL compression

9 Mb/sec transfer rate (7 times nutch.org crawler)


Participatory search: the idea

Participatory search: each participant builds an index of the local contents and

sends it to a central server the central server implements a community search service

collecting and merging the participants' indexes

A model that fits community needs for dedicated search services

A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa)


Participatory Search

Centralized Participatory Distributed

Search Index Search resultsDocuments

C I

C I

C I

C ISC I S

C I S

C I SC I S C I S

C I S

C – Crawler I – IndexerS – Search Engine


Participatory Search: benefits

Participants are in charge of selecting what to index and to publish when to publish (no need of coordination with an

external crawler) Control on index update and freshness Publishing of Hidden Web content


Qualitatively, we show that

c’ is shorter than c, if s is compressible

Time(Aboost) = Time(A), i.e. no slowdown

A is used as a black-box

Storage and access methods: compression (WP 2.2)

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

Key Components: Burrows-Wheeler Transform,

Suffix Tree, and a Greedy processing of them

Our technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee


Storage and access methods (WP 2.1 and 2.2)

Repository of URLs Compressed Prefix and Suffix search within URLs

Search by hostname, path, file-ext, …

select count(*) from … where url LIKE ‘http://%.it/%.asp’

Up to two order of magnitude faster than using sequential scan and B-tree

Space occupacy << B-tree


Storage and access methods: index compression (WP 2.3)

Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps.

Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps).

Our problem consists of enhancing the Clustering Property of posting lists.


Compression Enhancement


Content delivery (WP 2.1, 2.2 and 2.3)

Web Caching Mining of web/proxy server requests aimed at improving LRU-

based document caching (WP 2.1)

Recommendation system (On line/Off line) Mining of web sessions aimed at profiling

users and recommending them related pages (WP 2.1, 2.3)

Transactional Clustering Clustering specialized on transactional data aimed at

categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2)


Content delivery (WP 2.3)

SUGGEST: a recommendation system made up of two distinct modules Offline: performing model extraction by a clustering algorithm

which partition the Usage Graph Online: performing users classification and suggestion

generation The WOS remarkably shortened implementation time (<

500 C++ lines) We used three WOS objects to produce a persistent clustering

structureCitationPageViewSession

sCluster



Goal: Retrieve the pages which match the user needs.

This is a much difficult task in the light of the fact that: the Web size is increasing and so the number of answers

the Web coverage is a problem for a single search engine

Web pages are heterogeneous

User needs are subjective and time-varying

“list of keywords” paradigm for a user query may be ambiguousSnakeT: clusters the web-snippets returned by many

search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the

answers returned for a user query


SnakeT: An example fo use


SnakeT: An example fo use

Look at theDEMO



Clustering of E-mails (manco) XML documents (chiara) ??


On going and future activities

Work in progress Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT

Web data to improve efficacy and efficiency in the interaction of the user with the Web

Implementation of additional WOS layers Compression booster, XML clustering

Future work (medium-long term) WOS, final version Community-oriented ranking Content (news, xml, ..) clustering Cooperation with Nutch.org

(Doug Cutting in Pisa next October) etc


Deployment scenarios

Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised The WOS is a research infrastructure, in the spirit of the

WebBase project at Stanford University The WOS is an infrastructure for web analytics services to be

offered to third parties, in a spirit close to the WebFountain IBM project

The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase


Demo Session

Three demos here WOS: browsing usage data (Mirko Nanni, Vincenzo

Bacarella) SnakeT: Web snippets clustering (Paolo Ferragina,

Antonio Gullì) ANTIX: Participatory Search System (Andrea Esuli)

Some other activities described in the Posters


More information

Interested people can find these slides, more information, documents and the full list of publications at the address:

http://ecd.isti.cnr.it

action 2: mine the web

Documents

web mining unipi

webindustrial day roma

contentecd industrial

pisaecd industrial day

knowledgeecd industrial

usersecd industrial

data mining techniques

enhanced web contents