genre and task for web page filtering michael shepherd web information filtering lab faculty of...
TRANSCRIPT
![Page 1: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/1.jpg)
Genre and Task for Web Page Filtering
Michael Shepherd
Web Information Filtering Lab
Faculty of Computer Science
Dalhousie University
![Page 2: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/2.jpg)
Research Team
• Students– Lei Dong– Alistair Kennedy– Richong Zhang
• Faculty– Carolyn Watters– Jack Duffy
![Page 3: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/3.jpg)
Overview
• Introduction
• Genre
• Task
• Summary
![Page 4: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/4.jpg)
Introduction
• The focus of our current research is the investigation of filtering techniques for the Web
• This includes context-aware retrieval where context includes:– Adaptive user modeling– The user’s “task”
• Information need• What it is the user is trying to do
• We are moving to incorporate the notions of genre and task and to evaluate the impact that these have on filtering
![Page 5: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/5.jpg)
Filtering
GenreTask
UserProfiles
![Page 6: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/6.jpg)
Motivation for Research
• The Web has billions of documents
• Average query is 2-3 words
One document will satisfy our information need!
![Page 7: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/7.jpg)
But it’s more than just search
• “Browsing or surfing the Web represents the main model for web use, especially among younger users.” (Hunter)
![Page 8: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/8.jpg)
• Three general types [Marchionini]– Directed browsing – explicit info need– Semi-directed browsing – less well defined need– Undirected browsing - there is no real goal and
the user is “surfing”
Browsing
Continuum
Surfing Searching
![Page 9: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/9.jpg)
Motivated Behaviour
• Intrinsically Motivated Behaviour– “… is that which appears to be spontaneously initiated
by the person in pursuit of no other goal than the activity itself.” [Enzle, Wright, Redondo]
– “… engaging in a task for its enjoyment value…” [Deci, Ryan]
• Extrinsically Motivated Behaviour– “… motivation is to engage in an activity as a means to
an end … participation will result in desirable outcomes such as reward …” [Pintrich, Schunk]
![Page 10: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/10.jpg)
Task and Information Need
Continuum
General information gathering
Explicit Information
need
I’m shopping for a computer
I want the price on the
Dell Inspiron Notebook computer
So, one document may not satisfy the
information need
![Page 11: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/11.jpg)
Search Engine Results
0123456789
1 2 3 4 5 6 7 8 9 10 11
Ranked Screens, 10 hits per screen
Nu
mb
er o
f R
elev
ant
Hit
s
![Page 12: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/12.jpg)
Optimal Results, After Filtering
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
Ranked Screens, 10 hits per screen
Nu
mb
er o
f R
elev
ant
Hit
s
![Page 13: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/13.jpg)
Why look at Genre and Task?
![Page 14: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/14.jpg)
Filtering Based on Adaptive User Profiles and IR-type of Task
• Intrinsic Motivation– Fine-grained filtering of the Web is not feasible when
the browsing task is “undirected”
• Extrinsic Motivation– Fine-grained filtering of the Web is feasible when there
is an explicit information need
![Page 15: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/15.jpg)
Genre
• A genre is a “classifying statement”• It allows us to recognize items that are similar
even in the midst of great diversity – Newspapers– Mystery novels– Office memos
• socially recognized communicative purpose
• Generally characterized by the tuple:<content, form>
![Page 16: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/16.jpg)
![Page 17: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/17.jpg)
Cybergenre
• Genre on the web
• Characterized by the tuple
<content, form, functionality>
• Where functionality is the functionality afforded by the new medium, i.e., the web
![Page 18: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/18.jpg)
cybergenre
extant novel
replicated variant emergent spontaneous
electronic newspaper
multimedia newspaper
personalized newspaper
FAQ
![Page 19: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/19.jpg)
![Page 20: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/20.jpg)
Recognizing Genres of Web Pages
• The number of cybergenres is increasing, with different estimates putting the number at well over 1000 (depends on granularity)
• It is difficult to know the boundaries of a genre and to know when one has crossed from one genre into another genre
• It is difficult to know when a web page represents the emergence of a new genre
![Page 21: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/21.jpg)
Research Problems
• How can we identify automatically the genre of a web page?
• What features should be used in describing web pages?
• How can we make this adaptive to recognize:– New genre when they emerge?– Genre classes that are fuzzy and genres that slide
from one class to another?
![Page 22: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/22.jpg)
Research Questions
• Can we identify home pages?
• Can we distinguish among the sub-genres:– personal, corporate and organization home
pages?
• What influence does the functionality attribute have in distinguishing these genres and sub-genres?
![Page 23: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/23.jpg)
![Page 24: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/24.jpg)
![Page 25: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/25.jpg)
![Page 26: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/26.jpg)
Machine Learning Model and Dataset
• The dataset consisted of 321 web pages– 17 were classified manually as belonging to two of the
three home page sub-genres– 94 corporate home pages– 93 personal home pages– 74 organization home pages– 77 noise pages
• Neural Net Model– Single classifier with three target output classes– Three different classifiers, one for each of three target
output classes
![Page 27: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/27.jpg)
Features
Content Number of Meta tags used.
Does the page contain any phone numbers?List of most common words appearing in between 16% and 40% of all documents.
FormNumber of images.Does the page have its own domain, or is it in a sub-directory within a domain?Size of file in bytes.Number of words in the page.
FunctionalityNumber of Links in the Web Page.Number of E-mail Links.Prop. of links that are navigational links to other web pages within the same site.Prop. of links that are links to locations within the same page.Prop. of links that are links to other pages on other sites.Number of form inputsIs the first tag a Script tag?
![Page 28: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/28.jpg)
Terms Selected as Features
Class TermsPersonal Home Page
my, me, i, t
Corporate Home Page we, services, service, available, fax, our, us, com, contact, copyright, free, amp
Organization Home Page
events, community, organization, 2004, help, its, members, news, information
![Page 29: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/29.jpg)
Neural Net Categorization
Personal Home Page
Organization Home Page
Corporate Home Page
Target CategoriesNeural Net
Data Set of Web Pages of Known Genre Type
Input Feature Vector
![Page 30: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/30.jpg)
Evaluation
• Recall– The proportion of web pages of genre type Gi
that are correctly categorized into category C i
• Precision– The proportion of web pages categorized into
category Ci that are of genre type Gi
precisionrecall
precisionrecallmeasureF
2
F-measure(Gi) = the quality of the classifier with respect to web pages of genre type Gi
![Page 31: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/31.jpg)
10-Fold Cross Validation
• Used when data set is small in order to obtain statistically valid results
![Page 32: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/32.jpg)
10%
10%
10%
10%
10%10%
10%
10%
10%
10%
![Page 33: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/33.jpg)
Test Set 1 10 %
Training Set 90%
Test Set 2 10 %
Test Set 3 10 %
![Page 34: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/34.jpg)
<content, form, functionality>
<content, form> Significant Difference
Personal Home Page
.711 .702 -
Corporate Home Page
.666 .637 .005
Organization Home Page
.553 .555 -
F-measures using separate classifiers with noise pages
![Page 35: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/35.jpg)
F-measures using single classifier with noise pages
<content, form, functionality>
<content, form> Significant Difference
Personal Home Page
.712 .698 .05
Corporate Home Page
.650 .644 -
Organization Home Page
.537 .536 -
![Page 36: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/36.jpg)
Misclassification tablesSingle Classifier
<content, form, functionality>
P C O Non-home
Personal 62.2 3.1 8.2 22.2
Corporate 3.7 56.5 14.8 25.4
Organization 4.8 12.2 36.5 25.9
Noise Pages 11.1 7.4 6.7 52.9
![Page 37: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/37.jpg)
Genre Summary
• We can recognize home pages from noise pages
• We can distinguish personal home pages from corporate and organization home pages, but distinguishing between corporate and organizational home pages is difficult
• Feature set needs a lot more attention paid to it
![Page 38: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/38.jpg)
Open Questions
• What is an appropriate feature set?• Full evaluation of functionality attribute• What ML model to use?
– Accuracy and scalability
• Adaptive– Track recognized genres as they evolve– Recognize the introduction of a novel genre
not seen previously– Is this like topic detection and tracking?
![Page 39: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/39.jpg)
Genre and Task on the Web?
Group Genre Task Recognition
Topics Home page, location, special topics
Cultural, shopping, news, health
url only host name, short, lots of graphics
Publications Articles, publications, news
Scholarly research, news, financial
Hierarchical structure, longer, few graphics
Products Product info, reviews, order forms
Shopping, news, computing
Short, prices, phone numbers
Educational Glossary, course list, instructional material
Educational pursuits edu domain, education lexicon
FAQ FAQ Health, self-help Metadata and headings, structure
Roussinov, et al., Genre Based Navigation on the Web, HICSS’34
![Page 40: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/40.jpg)
Yahoo Directory
![Page 41: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/41.jpg)
Yahoo Directory• Yahoo categories are created and
maintained manually– Creator of a web site submits a description – Editors review these
• Can we automatically classify a web page by task?
![Page 42: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/42.jpg)
Experiment• Creation of data set
• Data cleaning
• 10-fold cross validation – Feature selection (IG) – Principal component analysis– Build Decision Tree– Testing
![Page 43: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/43.jpg)
Creation of Data Set
• Selected 120 web pages randomly from Yahoo directories in each of:– Shopping– Health– Education
• Selected 70 pages (NSHE) not from the Web that are not shopping, health or education
• Total of 430 Web pages
• Validated by 3 raters
![Page 44: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/44.jpg)
Data Cleaning
• XML, HTML tags – <href>, <img>, <p>
• Pictures, Audio files, Video files
• Scripts– <javascript>
• Stop words
• Porter’s stemming algorithm
![Page 45: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/45.jpg)
Feature Selection Using the
Information Gain (IG)
• Employed as a term goodness criterion
• Based on Information Theory– The number of “bits of information” gained by
knowing the term is present or absent
![Page 46: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/46.jpg)
Information Gain (IG)
• A measure of importance of the feature for predicting the presence of the class.
The information gain of term t is defined to be
1 1 1
( ) ( ) log ( ) ( ) ( | ) log ( | ) ( ) ( | ) log ( | )m m m
r i r i r r i r i r r i r ii i i
G t P c P c P t P c t P c t P t P c t P c t
denotes the set of categories in the target space. 0
m
i ic
![Page 47: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/47.jpg)
Information gain (IG)Health Shopping Education IG value
Educ 23 4 93 0.352725188
Diseas 49 1 0 0.200911642
Medic 57 5 2 0.19171664
Health 79 15 19 0.188112451
Teacher 0 1 46 0.185452451
School 7 6 60 0.170352452
Price 5 50 1 0.16546535
Item 2 52 7 0.156980483
Ship 1 43 2 0.149329261
Student 6 3 51 0.148850138
Custom 7 50 4 0.133412067
Accessori 0 32 0 0.130532457
Cancer 32 0 0 0.130532457
Doctor 36 1 1 0.124860273
Public 16 3 51 0.12081971
Shop 10 55 9 0.120056849
Heart 33 2 0 0.116157938
Cart 0 35 4 0.114589777
Medicin 37 2 2 0.113854763
Physician 27 0 0 0.10811821
Risk 26 0 0 0.103738132
Number of documents in which term appears in each category
![Page 48: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/48.jpg)
Information gain (IG)
Information Gain
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 1246 2491 3736 4981 6226 7471 8716 9961
Information Gain
300 features
![Page 49: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/49.jpg)
Document Term Matrix
1
1 11 1
1
n
n
m m mn
t t
d a a
Documents
d a a
324 Documents (108 in each of Health, Shopping and Education)
300 terms as identified by the Information Gain measure
![Page 50: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/50.jpg)
Principal Component Analysis
• Identifies patterns in data and is a way to express the data is such a way as to highlight their similarities and differences
• Once these patterns have been found in the data, we can reduce the number of dimensions without much loss of data
![Page 51: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/51.jpg)
PCA
• Calculate covariance matrix of original data• Calculate eigenvalues and eigenvectors of
covariance matrix• Largest eigenvector identifies principal
component• The principal component is the eigenvector that
expresses the most significant relationship among the data dimensions
![Page 52: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/52.jpg)
Principal Component Eigenvalues
First 3 eigenvectors carry most of the information
![Page 53: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/53.jpg)
Matrix Projection
• After determining which components or eigenvectors to use, project the original document-term matrix into this new space
![Page 54: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/54.jpg)
![Page 55: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/55.jpg)
![Page 56: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/56.jpg)
Decision Tree
• Flow-chart-like tree structure
• Each internal node denotes a test on an attribute
• Each branch represents an outcome of the test
• Leaf nodes represent classes or class distributions.
• Used for classification
![Page 57: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/57.jpg)
Decision Tree• The tree’s generation process could be
seen as the generation of rules.
• First, build a tree from a known training data set.
• Then, use this tree to predict new data set. Decision tree makes rules among data visualized, and easy to understand.
![Page 58: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/58.jpg)
Decision Tree
![Page 59: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/59.jpg)
Health
Shopping
Education
NSHE
Health 10.0 0.8 0.5 0.71
Shopping 0.8 9.9 0.1 1.2
Education 0.9 0.1 9.2 1.8
NSHE 0.6 1.1 1.6 3.7
Confusion MatrixTarget Categories
Ori
gin
al C
ate
go
rie
s
![Page 60: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/60.jpg)
Precision and Recall
Precision 0.81 0.83 0.81 0.50
Recall
0.83
0.83
0.77
0.53
Health
Shopping
Education
NSHE
Health 10.0 0.8 0.5 0.7
Shopping 0.8 9.9 0.1 1.2
Education 0.9 0.1 9.2 1.8
NSHE 0.6 1.1 1.6 3.7
![Page 61: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/61.jpg)
Conclusion and Future Work
• As a filter, this approach would identify 80% of pages in Health, Shopping or Education
• Evaluate other classifiers
• System has to be scaled up:– More tasks, such as entertainment and sports– Larger data set with more noise
• Add form and functionality features to determine if there are recognizable genres of tasks
![Page 62: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/62.jpg)
How do I see these filters working?
Search
Engine
Filter
By Task
Filter
By Genre
Query
Task
Genre
Search Results
Filtered Results
![Page 63: Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University](https://reader036.vdocument.in/reader036/viewer/2022062407/56649e145503460f94afe6b7/html5/thumbnails/63.jpg)
Thank You
Web Information Filtering Lab
http://www.cs.dal.ca/wifl/