automatic classification of bookmarked web pages

23
1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007

Upload: solomon-beach

Post on 02-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Automatic Classification of Bookmarked Web Pages. Chris Staff First Talk February 2007. Overview. General Principles Reading List Tasks involved Schedule. General Principles. Email: [email protected] Web site: http://www.cs.um.edu.mt/~cstaff Plagiarism Referencing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Classification of Bookmarked Web Pages

1

Automatic Classification of Bookmarked Web Pages

Chris StaffFirst Talk

February 2007

Page 2: Automatic Classification of Bookmarked Web Pages

2

Overview

• General Principles• Reading List• Tasks involved• Schedule

Page 3: Automatic Classification of Bookmarked Web Pages

3

General Principles

• Email: [email protected]• Web site: http://www.cs.um.edu.mt/~cstaff

• Plagiarism• Referencing• ACM Digital Library: Membership for students from Malta

Page 4: Automatic Classification of Bookmarked Web Pages

4

Reading List– Abrams, D., Baecker, R.: How people use WWW bookmarks. In: CHI ’97:

CHI ’97 extended abstracts on Human factors in computing systems, New York, NY, USA, ACM Press (1997) 341-342

– Bugeja, I.: Managing WWW browser’s bookmarks and history (a Firefox extension). Final year project report, Department of Computer Science & AI, University of Malta, 2006. http://hyper.iannet.org/hyperBkreport.pdf

– Cockburn, A., McKenzie, B.: What do web users do? an empirical analysis of web use. In: Int. J. Hum.-Comput. Stud. 54(6) (2001) 903-922

– Staff, C.: Automatic Classification of Web Pages into Bookmark Categories. Submitted to UM’07, 2007.

– Staff, C.: CSA3200 User Adaptive Systems Lecture Notes, 2006. Follow link from http://www.cs.um.edu.mt/~cstaff/

– Mozilla Development Center: 2006, “Building an Extension”., http://developer.mozilla.org/en/docs/Building_an_Extension

Page 5: Automatic Classification of Bookmarked Web Pages

5

Classifying Bookmarks

• When a user bookmarks a page (or adds a page to Favorites) we want to recommend the best existing category– Improvement over simply recommending last category saved to

– Improvement over simply offering ‘category root’

Page 6: Automatic Classification of Bookmarked Web Pages

6

Tasks

1. Representation of bookmark categories

2. Two clustering/similarity algorithms

3. Extra utility4. User interface5. Evaluation6. Write up report

Page 7: Automatic Classification of Bookmarked Web Pages

7

Tasks Overview

• We are going to implement a number of algorithms to help with the overall task.– Some of these will be used while the user is browsing

– Others will be used to classify pages ‘off-line’ (especially for the existing bookmark files)

• We’re going to have a ‘standard test bed’ for conducting the evaluation

Page 8: Automatic Classification of Bookmarked Web Pages

8

Tasks Overview

• Represent bookmark categories– We’re starting with populated bookmark files, so use ‘How Did I Find That?’ approach

– Plus another, individual approach

• When a page is to be bookmarked– If referrer page is available, identify topic of page

– Otherwise, identify page topic using ‘How Did I Find That?’ approach

• Compare current topic topic to bookmark category representations

Page 9: Automatic Classification of Bookmarked Web Pages

9

Tasks Overview

• User Interface– To replace the built in ‘Bookmark this Page’ menu item and keyboard command

– To display a new dialog box to users to offer choice of recommended category, last category used, and to allow user to select some other category or create a new category

Page 10: Automatic Classification of Bookmarked Web Pages

10

Tasks Overview

• Evaluation– Will be standard and automated– For testing purposes, download test_eval.zip from home page•Contains 2x8 bookmark files (.html) and one URL file (.txt)

•Bookmark files are ‘real’ files collected one year ago

•URL file contains a number of lines with following format:– Bk file ID, URL of bookmarked page, home category, exact entry from bookmark file (with date created, etc.)

Page 11: Automatic Classification of Bookmarked Web Pages

11

Tasks Overview

• Evaluation (continued)– Challenge to also ‘re-create’ bookmark file in the order that it was created by users

– Eventually, close to the end of the APT, the evaluation test data sets will be made available•About 20 unseen bookmark files and one URL file

– Same format as before– You’ll get bookmark files early to prepare representations, but classification run will be part of a demo session

Page 12: Automatic Classification of Bookmarked Web Pages

12

Tasks Overview

• Write up report– We’ll spend some time looking at the structure of a scientific report, how to write a literature review, present evaluation results, etc.

Page 13: Automatic Classification of Bookmarked Web Pages

13

Task: Representing Bookmark Categories

• We need to identify what a category or collection of bookmarks is about so that we can check if a new page could belong to that category

• Ideally, we find out what is similar between the different documents in the category (especially if we know which link a user followed to reach child!)

• In the absence of this information use:– One algorithm will be based on ‘How Did I Find That?’

– A second algorithm that is up to you

Page 14: Automatic Classification of Bookmarked Web Pages

14

Task: Two clustering/similarity

algorithms• Once we have represented the categories, we can ‘send’ page to be bookmarked to best category– Similar to ‘information filtering’ or ‘clustering’

– What similarity measure or clustering algorithm to use?•One way of representing page to be classified will be based on ‘How Did I Find That?’

•Other way researched/developed by you

Page 15: Automatic Classification of Bookmarked Web Pages

15

Task: Extra Utility

• How can the classification of web pages to be bookmarked be improved?– What particular interests do you have, and how can they be used to improve classification?•E.g., synonym detection, automatic reorganisation of bookmarks, …

Page 16: Automatic Classification of Bookmarked Web Pages

16

Task: User Interface

• Can use XUL to ‘extend’ Mozilla Firefox– http://www.xulplanet.com/tutorials/xultu/

• Use Ian Bugeja’s HyperBK as a framework (with due referencing and acknowledgement, of course): https://addons.mozilla.org/firefox/2539/

• Programs are likely to be JavaScript• Your extension will then be portable

Page 17: Automatic Classification of Bookmarked Web Pages

17

Task: User Interface

• You can use Ian’s interface, but it may need some work to tweak it:– To support some of the new functionality that you’re adding (e.g. choice of algorithms)

– And to fix some of the usability problems with the dialog box

Page 18: Automatic Classification of Bookmarked Web Pages

18

Task: Evaluation

• ACofBWP will be evaluated!• But you must build a version of the program that can be called in batch mode; that will accept a directory containing bookmark files and a URL file; that will run in two modes (classify and reconstruct); and that will report faithfully on its performance.

Page 19: Automatic Classification of Bookmarked Web Pages

19

Task: Write Up Report

• At least one tutorial will be dedicated to good report writing practice; how to write a literature review; how to build and write references; how to present evaluation results.

Page 20: Automatic Classification of Bookmarked Web Pages

20

Grading Structure

• 10% for obtaining an average of at least 0.8 precision on evaluation (for random bookmark classification, using either implemented approach)

• 10% for incurring a maximum 2 second overhead on average to classify a page (must faithfully report time overhead)

• Max. 10% for extra utility.• 40% Report• 15% Presentation• 15% Artifact Design/Implementation

Page 21: Automatic Classification of Bookmarked Web Pages

21

Future Opportunities

• FYP supervision• Opportunity to co-author research paper that will be submitted to leading IR/AH/UM conference (irrespective of FYP)

Page 22: Automatic Classification of Bookmarked Web Pages

22

Pitfalls

• Utilities must be lightweight– Mostly those that are interactive, or that are invoked while user is browsing

• Should all of a document be used to contribute to a category representation/be used in a similarity measure?

Page 23: Automatic Classification of Bookmarked Web Pages

23

Schedule

• Until w.c. 6th March inc: Discussion, talks once/week

• w.c 19th March: Submit TOC/chapter overview for feedback (optional)

• w.c. 23th Apr: Demo 1 (optional)• 23th Apr-7th May: Submit one chapter of your choice for feedback (optional)

• w.c. 7th May: Demo 2 (optional)• 14th May: Evaluation collection will be made available

• May 25: Submit APT report• June: Demo and evaluation under exam conditions