exploring a hybrid of support vector machines (svms) and a heuristic based system in classifying web...

25
Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya Tarnikova and Hassan Alam

Upload: gabriel-chandler

Post on 16-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Exploring a Hybrid of Support Vector Machines (SVMs) and a

Heuristic Based System in Classifying Web Pages

Santa Clara, California, USA

Ahmad Rahman, Yuliya Tarnikova and Hassan Alam

Page 2: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Why Classifying Web Pages?

Web Page Classification

is often a Pre-processing

Stage in a Number of Applications

• Web SearchWeb Search• Web Page SummarizationWeb Page Summarization• Display of Web Pages in Display of Web Pages in Small Screen DevicesSmall Screen Devices• Archiving Web PagesArchiving Web Pages• Format Conversion fromFormat Conversion from HTML to other formatsHTML to other formats

Page 3: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Why Classifying Web Pages?

• Specific AlgorithmSpecific Algorithm• Different way to apply Different way to apply Specific parametersSpecific parameters• Local OptimizationsLocal Optimizations

Web PagesClassify

Web Pages

ApplySpecific

Algorithm

Page 4: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

What Makes Web Pages Different from Each Other?

• Type of Content– Banking and Finance

– Programming Language

– Science

– Sport

– Others?

• Manifestation– Linguistic Difference

M. Sinha and D. Corne. A large benchmark

dataset for web document clustering. Int. Conf. on Hybrid Intelligence Systems,

2002.

Page 5: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Sports Page

Page 6: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Programming Page

Page 7: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Banking/Finance Page

Page 8: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

What is this?

Page 9: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

How Do We Use Web Classes?

• Do people writing a web page on banking/finance do it differently than people writing a sports page?

• We know there will be linguistic differences, but will there be structural differences as well?

• If there are differences, how do we characterize it?

Page 10: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Alternate Definitions?

• Intent of the Web Page– What is the Main Message?

• Convey Information?

• Help in Locating Information?

• Allow Specific Requests to be processed?

– Manifestation• Text/Link Mapping

• Specific Task Oriented tagset

Page 11: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Example 1: Informative Web Page (Primarily Textual Content)

Page 12: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Example 2: Locating Information(Primarily Links)

Page 13: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Example 3: Facilitator (Large Chunks of Forms)

Page 14: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Non-Linguistic Features: Structural and Hierarchical Information

• Number of large-story-type columns• Largest number of forms in one column• Text size• Number of links• Number of images• Number of columns with forms• ……and others.

Page 15: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Support Vector Machine

• Structural Risk Minimization– Vapnik-Chervonenkis (VC) Dimension- Property of set of functions - Maximum number of training points that can be shattered by - Ex ‘s VC dimension of the set of oriented lines

– VC Theory provides bounds on the test error, which depend on both empirical risk and capacity of function class

)}({ f

)}({ fNR

1nh

l

h

llh

llh

emp

hl

RR

)log()1(log)log(

)log(

42

),(

),()()(

Page 16: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Hyperplane Classification

11

11

ii

ii

yforbxw

yforbxw

Page 17: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

SVM Implementation

We have adopted an implementation of SVMlight, which is an implementation of Vapnik's Support Vector Machine [1] for the problem of pattern recognition. The optimization algorithm used in SVMlight is described in [2].

[1] Vladimir N. Vapnik. [1] Vladimir N. Vapnik. The Nature of Statistical The Nature of Statistical

Learning Theory. Learning Theory. Springer, 1995.Springer, 1995.

[2] T. Joachims. In “Making [2] T. Joachims. In “Making large-Scale SVM Learninglarge-Scale SVM Learning

Practical”. Advances in Kernel Practical”. Advances in Kernel Methods – Support Vector Methods – Support Vector

Learning, Learning, B. Schölkopf and C. Burges B. Schölkopf and C. Burges

and A. Smola (ed.). and A. Smola (ed.).

MIT Press, 1999.MIT Press, 1999.

Page 18: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Initial Experiment

Database:Database:200 Randomly 200 Randomly

SelectedSelectedWeb PagesWeb Pages

Training Database:Training Database:100 100

Test Database:Test Database:100 100

Classes: Classes: 1. Story Pages1. Story Pages

2. Reference Pages2. Reference Pages3. Form Pages3. Form Pages

SVM Performance:SVM Performance:On Training Data: 95%On Training Data: 95%

On Test Data: 87%On Test Data: 87%

SVM:SVM:Dot ProductDot Product

Pair-wisePair-wise

Page 19: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Hybridization

Heuristic-BasedMethod

Forms

Non-Forms

References Stories

All Web Pages

SVM

Forms

Page 20: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Form Separation HeuristicsDefining Form Probability Score (FPS) as (F) = ∑all forms i f(i)*w(i),

where,  Individual form score f(i) = #(submits & resets) * 0.2 + #(radio buttons

and check boxes) * 0.5 + #(all other active fields);

And, defining the “Weight” w(i) for the form as the following:w(i) = f(i), if f(i) є [0, 2],w(i) = 2 + (f(i) – 2)/2, if f(i) є [2, 4],w(i) = 3 + (f(i) – 4)/4, if f(i) є [2, 6],w(i) = 3.5 if f(i) > 6

 Based on these two parameters, a web page is a form if: size of the text preceding first form is less then 300, and F / (#links) > 0.25

and F / (#text) > 0.01.

Page 21: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

New Experiment (1)

Training Set: First 100Training Set: First 100Test Set: Last 100Test Set: Last 100

  Story Reference

Story 42 1

Reference 1 46

  Story Reference

Story 41 1

Reference 4 37

On Training Data: 97% CorrectOn Training Data: 97% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct

First StageFirst Stage: (Heuristics): 100% on Train and Test Data: (Heuristics): 100% on Train and Test Data

Second StageSecond Stage Second StageSecond Stage

Combined:Combined:Training: 98%Training: 98%

Test: 95%Test: 95%

Page 22: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

New Experiment (2)

Training Set: Last 100Training Set: Last 100Test Set: First 100Test Set: First 100

On Training Data: 98% CorrectOn Training Data: 98% Correct On Test Data: 90% CorrectOn Test Data: 90% Correct

First Stage: (Heuristics): 100% on Train and Test DataFirst Stage: (Heuristics): 100% on Train and Test Data

Second StageSecond Stage Second StageSecond Stage

Combined:Combined:Training: 99%Training: 99%

Test: 91%Test: 91%

  Story Reference

Story 40 1

Reference 0 42

  Story Reference

Story 39 4

Reference 5 42

Page 23: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Average Accuracy

HybridHybridOn Training Data: 98.5%On Training Data: 98.5%

On Test Data: 93%On Test Data: 93%

Pair-wise SVMPair-wise SVMOn Training Data: 95%On Training Data: 95%

On Test Data: 87%On Test Data: 87%

Page 24: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Future Work?

• We want to correlate different types of pages (structure) with respect to linguistic differences

• We want to characterize the structural features we used with respect to purely linguistic features

• Quantify the improvement in a secondary process due to the success/failure of web classification process

Page 25: Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya

Conclusion • SVM is a very effective solution for web page classification SVM is a very effective solution for web page classification • Often the pre-defined number of web classes is smallOften the pre-defined number of web classes is small• Heuristics, if correctly applied, can be very useful in boosting Heuristics, if correctly applied, can be very useful in boosting the SVM ensemblethe SVM ensemble• For a problem of more than three classes, heuristics can be For a problem of more than three classes, heuristics can be applied in sequenceapplied in sequence• For problems of more that three classes, solving ties of the For problems of more that three classes, solving ties of the pair-wise classifiers becomes a major problem – this is pair-wise classifiers becomes a major problem – this is addressed in a later paper (MCS2003)addressed in a later paper (MCS2003)• Current applications of this include web page summarization and Current applications of this include web page summarization and re-authoringre-authoring