action 2: mine the web

44
Enhanced Content Deliver y Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004

Upload: lorne

Post on 07-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Action 2: Mine the Web. Industrial Day Roma, 10 Giugno 2004. Action 2 - Partners. Dipartimento di Informatica, Università di Pisa. KDD & HPC Labs ISTI-CNR, Pisa. ICAR-CNR, Cosenza. Action 2 – Mine the Web. The project: four Work Packages - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Action 2: Mine the Web

Enhanced Content Delivery

Action 2: Mine the Web

Industrial Day

Roma, 10 Giugno 2004

Page 2: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 - Partners

ICAR-CNR, Cosenza

KDD & HPC Labs ISTI-CNR, Pisa

Dipartimento di Informatica, Università di Pisa

Page 3: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 – Mine the Web

The project: four Work Packages

(Action Coordinator Dott. Fosca Giannotti, ISTI-CNR) Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR)

WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica

Work Package 2.2. Indexing and compression (UNIPI) WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica

Work Package 2.3. Managing Terabytes (ISTI, ICAR) WP Coordinator : Dott. Raffaele Perego, ISTI-CNR

Work Package 2.4. Participatory Search Services (UNIPI) WP Coordinator : Prof. Maria Simi, Dip. Informatica

Page 4: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 – Mine the Web

The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1

The focus is on Delivering Enhanced Web Contents to (Communities of) Users: Exploiting Web Mining to extract knowledge/models that can

be used to enhance efficacy and efficiency of the various phases of the information search process

Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users

Page 5: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Motivations

On-line data grows rapidly: 50+M new pages/day, font: IBM 100+k news, articles/day font: IBM Databases, digital libraries, etc.

Internet use tracking produces additional interesting data: Servers logs, WSE logs, network traffic logs

Goldman Sachs estimates (2002):“between 80 and 90 percent of information on the Internet and corporate networks is unstructured”

Page 6: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Motivations The limits of the current means of access to web contents

are becoming clear Low precision and quality, difficulty of matching users’

subjective relevance over-abundance of low-quality web materiallow covering and freshness

much relevant information in the hidden web ranking mechanisms penalize important pages that enter the

scene Difficulties in

managing size, complexity, heterogeneity identifying Patterns and Trends within huge amounts of

unstructured contents

Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge

Page 7: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Web Mining

User-Centric View (Client-Side) discovery of documents on a subject discovery of semantically related documents or document

segments extraction of relevant knowledge about a subject from

multiple sources

Web Mining: Exploiting Data Mining techniques with data coming from the Web

Data Mining: the process of discovery interesting knowledge from large amount of data stored in databases, data warehouses, or other repositories

Goal: assist users or site owners in finding something useful/interesting/relevant

Owner-Centric View (Server-Side) increasing contact / conversion efficiency (Web marketing) targeted promotion of goods, services, products, ads measuring effectiveness of site content / structure providing dynamic personalized services or content

Page 8: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Web Mining Taxonomy

Web Mining

Web Usage Mining

Web Content Mining

Web Structure

Mining131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/finger.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/logokdd.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:09 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 200 131072131.114.21.41 - - [27/May/2004:19:24:12 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 196608131.114.21.41 - - [27/May/2004:19:24:13 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 338224

Page 9: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Web Mining Applications Web Usage Mining

discovering customer preference and behavior Web personalization / collaborative filtering adaptive Web sites / improving Web site organization e-business intelligence, etc.

Web Content Mining information filtering / knowledge extraction Web document categorization discovery of ontologies on the Web, etc.

Web Structure Mining Finding "Quality" or "authoritative" sites based on linkage and citations

IBM CLEVER project Google

Etc.

Page 10: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Some related projects

WebFountain - IBM WebBase - Stanford DBGroup

Page 11: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WebFountain

World-Wide Web, News

Forums, Weblogs, etc.

Newspapers, Magazines, etc.

Customer Electronic Text WebFountain

Infrastructure

for

Advanced Text Analytics

Finds patterns, trends and

relationships in text

Application Examples:

• Marketing

• Intelligence

• Research

IBM

Page 12: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WebFountain: an infrastructure for Advanced Text Analytics applications

CustomerDBs

3rd

PartyDBs

CustomerDBs

Application Server

CustomersInternet

Intranets

NewsFeeds

Crawler

Crawler

Structured Data

GathererStructured D

ata Gatherer

Data StoreData Store

Information Miners

Information Miners

CommunicationsInfrastructure

Index(es)Index(es)

Cluster Management System

Crawler

Crawler

Structured Data

GathererStructured D

ata Gatherer

Data StoreData Store

Information Miners

Information Miners

CommunicationsInfrastructure

Index(es)Index(es)

Cluster Management System

PROJECT WF INFRASTRUCTURE

½ PetabyeCluster capacity

2,000,000,000 Number of pages in store

25,000,000 Number of pages crawled per day

10,000Number of pages mined per second

3674 Number of 73GB hard drives

1231 Number of CPU’s

250

Number of scientists and researchers who have contributed to WebFountain technology

100 Patents pending

75 Patents issued

70Megabytes/sec traffic coming in from internet

5 minutes, 22 secondsTime to complete query

5Number of countries contributing to technology

Page 13: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WebFountain: Reputation Tracking

Page 14: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WebBaseStanford DBgroup

Page 15: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WebBase Challenges

Scalability crawling archive distribution index construction storage

Consistency freshness versions

Dissemination

Archiving “units” coordination

IP Management copy access link access access control

Hidden Web Topic-Specific

Collection Building

Page 16: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 – Mine the Web: application scenario

So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group

We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery better search services, better categorization and document

classification services, better question answering services

Page 17: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Action 2 – Mine the Web

Ambitious objective:Exploit the combination of Web data about:

USAGE, STRUCTURE, CONTENT

originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge

extraction process from the users point of viewDeveloping solutions:

Innovative w.r.t. the state of the art Appropriate for the Web domain

Page 18: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Virtual Organizations

Virtual CommunityInternet

Page 19: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Tracking Virtual Organizations

Tracking the interaction of the virtual community with internet allows us to collect several interesting information

Network Traffic data provide detailed information about:

Usage Preferred sites, user sessions

Content Accessed Documents

Structure From client sessions we can

build the usage Web subgraph By parsing the documents

retrieved we can build the corresponding link graph

Virtual Community

Page 20: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Tracking Virtual Organizations

Link graph

Traffic graph

Link andTraffic graph

Virtual Community

Page 21: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

We need an infrastructure: the Web Object Store (WOS)

A Web Data Management System optimized to efficiently handle content, usage, and structure web dataPurpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our user-centric view Manage large collections of

Web pagesPreprocessed Usage dataStructure data

Collected within our virtual community

Page 22: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Related activities:

- Clustering Emails

- Caching of Documents and of Query results

- Efficient and scalable pattern mining and clustering algorithms

- Enhanced compression methods

- Clustering/categorizing query results snippets

- Clustering XML documents

- Etc.

WOS and related activities

Clustering/Pattern/Classification Web Mining algorithms

Efficient and scalable access methods:

• IXE b-trees, full-text indexes

• search in compressed dataData cleaning, preprocessing, filteringPopulation:

•traffic raw data of our community

•IXE Crawler

•Partecipatory search

Efficient and scalable storage:

• IXE persistent objects

• compression

• distributed architecture

Persistent store of objects Web data management

system for web content, structure and usage data

Management of data at many abstraction levels

Fast development of new applications Easy C++ annotation of

new persistent objects Read and write data in

tables

Page 23: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WOS applications Some innovative applications are currently pursued within

our project: Characterization, on the basis of usage only or usage +

contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites);

crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents

Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance

Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis

Caching and clustering of web search results

Page 24: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WOS population: usage data (WP 2.1)

Many-to-many interactions Inter-site user sessions Massive data

Millions/day HttpRequest ~1 GB/day raw data

We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it) The whole University of Pisa

Page 25: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WOS population: content data (WP 2.4)

Methods to gather contents to populate Web Object Store IXE Crawler Participatory Search System (main activity this year) Hidden Web Search

Page 26: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

WOS population: content data (WP 2.4)

IXE crawlerinit

get next url

get page

extract urls

initial urls

web pages

Internet

Page 27: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

IXE Crawler

Parallel/distributed crawler High performance through:

asynchronous I/O (500 connections/thread) asynchronous DNS resolution keep-alive connections multi-threads URL compression

9 Mb/sec transfer rate (7 times nutch.org crawler)

Page 28: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Participatory search: the idea

Participatory search: each participant builds an index of the local contents and

sends it to a central server the central server implements a community search service

collecting and merging the participants' indexes

A model that fits community needs for dedicated search services

A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa)

Page 29: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Participatory Search

Centralized Participatory Distributed

Search Index Search resultsDocuments

C I

C I

C I

C ISC I S

C I S

C I SC I S C I S

C I S

C – Crawler I – IndexerS – Search Engine

Page 30: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Participatory Search: benefits

Participants are in charge of selecting what to index and to publish when to publish (no need of coordination with an

external crawler) Control on index update and freshness Publishing of Hidden Web content

Page 31: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Qualitatively, we show that

c’ is shorter than c, if s is compressible

Time(Aboost) = Time(A), i.e. no slowdown

A is used as a black-box

Storage and access methods: compression (WP 2.2)

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

Key Components: Burrows-Wheeler Transform,

Suffix Tree, and a Greedy processing of them

Our technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee

Page 32: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Storage and access methods (WP 2.1 and 2.2)

Repository of URLs Compressed Prefix and Suffix search within URLs

Search by hostname, path, file-ext, …

select count(*) from … where url LIKE ‘http://%.it/%.asp’

Up to two order of magnitude faster than using sequential scan and B-tree

Space occupacy << B-tree

Page 33: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Storage and access methods: index compression (WP 2.3)

Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps.

Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps).

Our problem consists of enhancing the Clustering Property of posting lists.

Page 34: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Compression Enhancement

Page 35: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Content delivery (WP 2.1, 2.2 and 2.3)

Web Caching Mining of web/proxy server requests aimed at improving LRU-

based document caching (WP 2.1)

Recommendation system (On line/Off line) Mining of web sessions aimed at profiling

users and recommending them related pages (WP 2.1, 2.3)

Transactional Clustering Clustering specialized on transactional data aimed at

categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2)

Page 36: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Content delivery (WP 2.3)

SUGGEST: a recommendation system made up of two distinct modules Offline: performing model extraction by a clustering algorithm

which partition the Usage Graph Online: performing users classification and suggestion

generation The WOS remarkably shortened implementation time (<

500 C++ lines) We used three WOS objects to produce a persistent clustering

structureCitationPageViewSession

sCluster

Page 37: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Content delivery (WP 2.2)

Goal: Retrieve the pages which match the user needs.

This is a much difficult task in the light of the fact that: the Web size is increasing and so the number of answers

the Web coverage is a problem for a single search engine

Web pages are heterogeneous

User needs are subjective and time-varying

“list of keywords” paradigm for a user query may be ambiguousSnakeT: clusters the web-snippets returned by many

search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the

answers returned for a user query

Page 38: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

SnakeT: An example fo use

Page 39: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

SnakeT: An example fo use

Look at theDEMO

Page 40: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Content delivery (WP 2.1)

Clustering of E-mails (manco) XML documents (chiara) ??

Page 41: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

On going and future activities

Work in progress Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT

Web data to improve efficacy and efficiency in the interaction of the user with the Web

Implementation of additional WOS layers Compression booster, XML clustering

Future work (medium-long term) WOS, final version Community-oriented ranking Content (news, xml, ..) clustering Cooperation with Nutch.org

(Doug Cutting in Pisa next October) etc

Page 42: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Deployment scenarios

Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised The WOS is a research infrastructure, in the spirit of the

WebBase project at Stanford University The WOS is an infrastructure for web analytics services to be

offered to third parties, in a spirit close to the WebFountain IBM project

The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase

Page 43: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

Demo Session

Three demos here WOS: browsing usage data (Mirko Nanni, Vincenzo

Bacarella) SnakeT: Web snippets clustering (Paolo Ferragina,

Antonio Gullì) ANTIX: Participatory Search System (Andrea Esuli)

Some other activities described in the Posters

Page 44: Action 2: Mine the Web

ECD - Industrial Day, Roma 10 Giugno 2004

More information

Interested people can find these slides, more information, documents and the full list of publications at the address:

http://ecd.isti.cnr.it