of the university of washington web mining of the...

4
The Information School of the University of Washington Web Mining Meliha Yetisgen-Yildiz The Information School of the University of Washington Outline Definition of Web Mining Web Mining Taxonomy • Examples Mining Topic Specific Concepts and Definitions The Information School of the University of Washington Web Mining Web is a huge collection of – Documents – Hyper link information – Access and usage information Mining enormous wealth of information on the Web – Financial information (i.e. stock quotes) – Book stores (i.e. Amazon) The Information School of the University of Washington WWW Facts Unstructured: No standards and heterogeneous Dynamic: Growing and changing very rapidly Size: Too huge for effective data warehousing and mining The Information School of the University of Washington Related Fields Natural Language Processing Information Retrieval Machine Learning • Statistics Information Visualization The Information School of the University of Washington Web Mining Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking Web Mining Taxonomy

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: of the University of Washington Web Mining of the ...faculty.washington.edu/.../Melihawebmining.pdf · 3 The Information School of the University of Washington Mining Topic Specific

1

Th

e I

nf

or

ma

tio

n S

ch

oo

lo

f t

he

Un

ive

rs

ity

of

Wa

sh

in

gt

on

Web Mining

Meliha Yetisgen-Yildiz

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Outline

• Definition of Web Mining • Web Mining Taxonomy• Examples

– Mining Topic Specific Concepts and Definitions

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

• Web is a huge collection of – Documents– Hyper link information– Access and usage information

• Mining enormous wealth of information on the Web– Financial information (i.e. stock quotes)– Book stores (i.e. Amazon)

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

nWWW Facts

• Unstructured: No standards and heterogeneous

• Dynamic: Growing and changing very rapidly

• Size: Too huge for effective data warehousing and mining

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Related Fields

• Natural Language Processing• Information Retrieval• Machine Learning• Statistics• Information Visualization

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Mining Taxonomy

Page 2: of the University of Washington Web Mining of the ...faculty.washington.edu/.../Melihawebmining.pdf · 3 The Information School of the University of Washington Mining Topic Specific

2

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWeb Page Summarization WebLog (Lakshmanan et.al. 1996),WebOQL(Mendelzon et.al. 1998) …:Web Structuring query languages; Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Content Mining

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Web Content Mining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web StructureMining

Web ContentMining

Web PageContent Mining Search Result Mining

Search Engine Result Summarization•Clustering Search Result (Leouskiand Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Structure Mining

Web Structure MiningUsing Links•PageRank (Brin et al., 1998)•CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages.

Using Generalization•MLDB (1994), VWV (1998)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General Access Pattern Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.

CustomizedUsage Tracking

Web Usage Mining

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Mining

Customized Usage Tracking

Adaptive Sites (Perkowitz and Etzioni, 1997)Analyzes access patterns of each user at a time.

Web site restructures itself automatically by learning from user access patterns.

Web UsageMining

General AccessPattern Tracking

Web Usage Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Web Content Mining Example

L. Bing, C.W. Chin, and H.T. Ng. (2003) “Mining Topic Specific concepts and Definitions on the Web”. WWW 2003, Budapest, Hungary.

Page 3: of the University of Washington Web Mining of the ...faculty.washington.edu/.../Melihawebmining.pdf · 3 The Information School of the University of Washington Mining Topic Specific

3

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Mining Topic Specific Concepts and Definitions

• Goal:– “To help people learn in-depth knowledge of a

topic systematically on the Web”

• Main Assumption:– The typical path of a person who wants to learn

more on a new topic• First : Definitions and/or descriptions of the topic• Second: Sub-topics and/or salient concepts of the

topic

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

System Architecture

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Evaluation

• 28 search topics from Computer Science• Precision comparison based on

– Top 10 results returned by WebLearn, Google, AskJeeves

– Related pages = Pages with definitions

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

nResults

0.000.0050.0010. Time Series

40.0020.0090.009. Fuzzy logic

20.0030.0080.008. Neural Network

0.000.0040.007. Linear Algebra

50.0033.3383.336. Relational Calculus

0.000.0033.335. Computer Vision

11.1122.2277.784. Machine Learning

50.0037.5075.003. Web Mining

10.0030.0070.002. Data Mining

0.000.0050.001. Artificial Intelligence

AskJeevesGoogleWebLearnSearch Topic

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Salient Concepts for Information Retrieval

1. Digital Libraries2. Modern Information Retrieval3. Indexing4. Images5. Relevance Feedback6. Internet7. Modeling8. Search Engines9. Information Processing10. Machine Learning

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

References• D. Backman and J. Rubbin. Web log analysis: Finding a recipe for success. In

http://techweb.comp.com/nc/811/811cn2.html, 1997.• O. Etzioni. The world-wide web: Quagmire or gold mine? Communications of ACM,

39:65-68, 1996.• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in

Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

• C. Faloutsos. Access methods for text. ACM Comput. Surv., 17:49-74, 1985.• R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT ). Proc.

1st Int. Conf. Knowledge Discovery and Data Mining, Montreal, Canada, Aug. 1995.

• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.• T. Imielinski and H. Mannila. A database perspective on knowledge discovery.

Communications of ACM, 39:58-64, 1996.• R. Meo, G. Psaila, and S. Ceri. A new SQL -like operator for mining association rules. In

VLDB'96, 122-133, Bombay, India, Sept. 1996.

Page 4: of the University of Washington Web Mining of the ...faculty.washington.edu/.../Melihawebmining.pdf · 3 The Information School of the University of Washington Mining Topic Specific

4

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

References• J. Graham-Cumming. Hits and miss-es: A year watching the web. In Proc. 6th Int. World

Wide Web Conf., Santa Clara, California, April 1997.

• M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997.

• J. Pitkow. In search of reliable usage data on the www. In Proc. 6th Int. World Wide Web Conf., Santa Clara, California, April 1997.

• T. Stabin and C. E. Glasson. First impression: 7 commercial log processing tools slice & dice logs your way. In http://www.netscapeworld.com/netscapeworld/nw-08-1997/nw-08-loganalysis.html, 1997

• T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server log files. In Proc. 3rd Conf. Human Factors & the Web, Denver, Colorado, June 1997.

• L. Tauscher and S. Greenberg. How people revisit web pages: Empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, Special issue on World Wide Web Usability, 47:97-138, 1997 Th

e In

form

atio

n Sc

hool

of th

e U

nive

rsity

of

Was

hing

ton

References• G. Salton, J. Allen, C. Buckley, and A. Singhal. Automatic analysis, theme

generation, and summarization of machine-readable texts. Science, 264:1421-1426, 1994.

• O. R. Za"iane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19-29, Santa Barbara, CA, April 1998.

The

Info

rmat

ion

Scho

olof

the

Uni

vers

ity o

f W

ashi

ngto

n

Questions ???