aftercollege self- service scrape configuration and posting utility kai hu haiyan wu march 17, 2009...

19
AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

Upload: leo-bryan

Post on 13-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

AFTERCOLLEGESELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY

Kai HuHaiyan WuMarch 17, 2009 @ Cowell 416Midterm Presentation

Page 2: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

PRESENTATION OUTLINE

Background and MotivationGoalsDesignChallengesTimeline and MilestonesCurrent Progress

2

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 3: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

AFTERCOLLEGE BACKGROUND

Customized career network for colleges and professional organizations across the country

Goal: create a better way for job seeking students and alumni to connect with the right employer

3

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 4: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

4

4

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 5: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

WHAT’S ALREADY THERE?0

4/1

8/2

3

AfterCollege staff manually creates configuration files

A simple crawler running periodically

Output of Crawler is posted on AfterCollege’s website

5

Afte

rColle

ge S

crape U

tility

Page 6: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

LIMITATIONS

ScalabilityUnable to handle POST requestsUnable to handle dynamic websitesExpensive to maintainRequires technical knowledge

6

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 7: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

DESIGN OVERVIEW0

4/1

8/2

3

7

A new GUI Tool assists staffs through configuration process

Web Proxy captures user activities

Crawler uses pattern matching based on new configuration file

Afte

rColle

ge S

crape U

tility

Page 8: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

GOALS: GUI TOOL

Guides users through configuration process Deal with dynamic websites

8

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 9: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

GOALS: WEB PROXY

Capture user activities Generate configuration files

9

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 10: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

GOALS: CRAWLER

Scrape job posts Check result integrity

10

04

/18

/23

Crawl Job List page

Get Configuration file

Pattern-Match Application

Generate Job List Result

Afte

rColle

ge S

crape U

tility

Page 11: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

DESIGN ISSUES FireFox Plugin vs. Web Proxy

Integration with back-end Ability to add functionalities

Dojo vs. YUI- Fade-In/Out, Drag & Drop - Deals with different browsers-

XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI

11

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 12: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

DESIGN OVERVIEWBrowser 0

4/1

8/2

3

12

Rendered HTML page

Injected YUI Javascript

Web Proxy

Apache

HTTP Client

Tomcat Web/App Server

HTML Parser

Job List Sites

Crawler

Loader/ Scheduler

Parser

HTTP ClientConfig.xml

JobFeed.xml Feed Generator

Afte

rColle

ge S

crape U

tility

Page 13: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

CHALLENGESDOM objects analysis at runtime for those

websites using AJAX to dynamically generate DOM objects at client side

Deal with tricky Javascript

Embedded HTML pages

13

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 14: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

MILESTONES GUI Tool (March 20)

Work flow support Capture job information

Web Proxy (March 20) Render html pages Capture HTTP communications

Web Crawler (April 13) Pattern Matching ability given configuration file Integrity check

Integration Test (April 20) Testing (April 27) 14

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 15: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

CURRENT FOCUS

Web Proxy Ability to deal with Javascript Session/Cookie support

GUI Tool Embedded web pages Allow user modifications

15

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 16: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

CURRENT PROGRESS

Demo

16

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 17: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

RESOURCES Course Instructor

Dr. Jeff Buckwalter Sponsor

Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System

Dasidae SVN from Perry Wiki Site

Knowledge share, work log, resource portal Google group

Discussion and information exchange medium

17

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 18: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

`

Questions?

18

04

/18

/23

Afte

rColle

ge S

crape U

tility

Page 19: AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, 2009 @ Cowell 416 Midterm Presentation

Thank You

19

04

/18

/23

Afte

rColle

ge S

crape U

tility