aftercollege self-service scrape configuration & posting utility kai hu haiyan wu may 14, 2009 @...

Post on 04-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AfterCollegeSelf-Service Scrape Configuration & Posting Utility

Kai Hu

Haiyan Wu

May 14, 2009 @ Harney 235

Presentation Outline

Background & Motivation Goals Design Challenges Implementation Details Project Demonstration Future Extensions

04

/20

/23

2

Afte

rColle

ge S

crape U

tility

AfterCollege Background

Customized career network for colleges & professional organizations across the country

Goal: Create a better way for job seeking students and alumni to connect with the right employer

04

/20

/23

3

Afte

rColle

ge S

crape U

tility

What's Already There?

Manually created configuration files Crawler that runs periodically Job feed outputs to be posted online

04

/20

/23

4

Afte

rColle

ge S

crape U

tility

config.xml jobFeed.xml

Staff Black Widow crawler AfterCollege website

Limitations

Scalability Expensive to maintain Requires technical knowledge Supports only GET requests Unable to handle dynamic websites

04

/20

/23

5

Afte

rColle

ge S

crape U

tility

Design Overview

GUI Tool that assists staffs through configuration process

Web Proxy that captures user activities New Crawler that uses both DOM & String Pattern

matching

04

/20

/23

6

Afte

rColle

ge S

crape U

tility

config file jobFeed.xml

GUI Tool Web Proxy New Crawler

Json files

04

/20

/23

7

Afte

rColle

ge S

crape U

tility

Design Overview

config file

Job feed

Advantages

Easing the pain; non-technical staff can also configure the system

Makes the configuration process more straight forward and easier to understand

Less expensive to maintain; take less than 10 minutes to reconfigure

Supports POST Possibility of extension to support more

complicated websites

04

/20

/23

8

Afte

rColle

ge S

crape U

tility

Challenges

Come up with easy-to-follow user interface Build a web proxy from scratch Distinguish patterns based on selected texts Develop crawler algorithm that handles job

information residing at different pages Deal with tricky Javascript Deal with embedded HTML pages Test crawling accuracy

04

/20

/23

9

Afte

rColle

ge S

crape U

tility

Design Decisions

FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities

Dojo vs. YUI Fade-In/Out, Drag & Drop Deals with different browsers Documentation

XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI

04

/20

/23

10

Afte

rColle

ge S

crape U

tility

Implementation Details

GUI Tool JS inserted to each page YUI for user interface, as JS toolkit AJAX for communication with web proxy

Web Proxy Java Servlet Jetty as web/app server Apache HttpClient

Crawler Regular Expressions for Pattern Match Scrapes jobs in per-page, per-field basis

04

/20

/23

11

Afte

rColle

ge S

crape U

tility

Implementation Details

04

/20

/23

12

Afte

rColle

ge S

crape U

tility

Add customized JavaScript to rendered HTML pages

Implementation Details

04

/20

/23

13

Afte

rColle

ge S

crape U

tility

Rendered HTML source code

Implementation Details

04

/20

/23

14

Afte

rColle

ge S

crape U

tility

Output content

Implementation Details0

4/2

0/2

3

15

Afte

rColle

ge S

crape U

tility

Implementation Details0

4/2

0/2

3

16

Afte

rColle

ge S

crape U

tility

Implementation Details0

4/2

0/2

3

17

Afte

rColle

ge S

crape U

tility

Dom Pattern

Implementation Details0

4/2

0/2

3

18

Afte

rColle

ge S

crape U

tility

String Pattern

Implementation Details0

4/2

0/2

3

19

Afte

rColle

ge S

crape U

tility

Project Demonstration

Future Extensions

Pagination Add support to crawl multiple pages

Tricky JavaScript Find solution to prevent redirection to different a domain

Embedded Pages Add functionality to get the HTML content of embedded

pages

04

/20

/23

21

Afte

rColle

ge S

crape U

tility

Resources

Course Instructor Dr. Jeff Buckwalter

Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn

Source code control System Subversion

Wiki Site Knowledge share, work log, resource portal

- http://cs690.wikispaces.com/ Google group

Discussion and information exchange medium

- http://groups.google.com/group/desidae

04

/20

/23

22

Afte

rColle

ge S

crape U

tility

Questions?

Thank you!

top related