web crawlers

14
WEB CRAWLERS Presented At: Indies Services

Upload: indies-is-now-milople

Post on 06-May-2015

1.905 views

Category:

Technology


0 download

DESCRIPTION

Understanding Basics of how a web crawler works.

TRANSCRIPT

Page 1: Web Crawlers

WEB CRAWLERS

Presented At: Indies Services

Page 2: Web Crawlers

Contents

What is a web crawler How does it work? Why use it? Challenges faced Coding crawlers Possible uses for us

Page 3: Web Crawlers

What are crawlers?

It’s a computer program. Ants, Automatic Indexers, Bots, Web

spiders, Web robots, Web scutters Search the web for web pages, links on

the pages. Any type of automated search or listing. Crawlers identification (user agent in

http request)

Page 4: Web Crawlers

How it works

Page 5: Web Crawlers

Basic algorithm for a crawler1. Remove a URL from the unvisited URL list2. Determine the IP Address of its host name3. Download the corresponding document4. Extract any links contained in it.5. If the URL is new, add it to the list of

unvisited URLs6. Process the downloaded document7. Back to step 1

Page 6: Web Crawlers

The Process

Initialize URL list with starting URLs(seeds)

List over ?

Pick URL from URL list

Parse page

Add URL to URL List

[No more URL][URL]

[No]

[Yes]

Crawling loop

[new URL]

Page 7: Web Crawlers

Uses of crawlers

Search engines : list out URLs, get page information up-to-

date Manipulates the web graph

Page 8: Web Crawlers

Uses of crawlers

Automated maintenance tasks : checking for broken internal links Validating HTML code

Crawler

Page 9: Web Crawlers

Uses of crawlers

Linguistics Textual search (what word common today)

Market researchers Determine trends

Getting Certain type of information from the web Email addresses (spamming) Images (special images searches) Meta tags information

Page 10: Web Crawlers

Challenges faced

What pages should it download? Large size of web : prioritize downloads

How to determine useful and unique links? URLs with GET requests (Internal links) URL normalization

Page 11: Web Crawlers

Challenges …

Crawling policies Selective policy (download most relevant

pages) Re-visit policy (when to check for changes

in the page) Politeness policy (robots

exclusion/robots.txt protocol) Parallelization policy (list new URLs)

Page 12: Web Crawlers

Coding Crawlers

Common Languages : PHP Python PERL Java etc. or any other server side scripting languages

Logic used: Get the URLs Search for unique URLs from the list Download the page or get information from

any particular page Process that information

Page 13: Web Crawlers

Possible uses for us

To maintain coding standards : check for proper code in a page.

Getting rid of unwanted or deprecated data : images or files that are no longer used.

To provide customized search in any particular site.

Page 14: Web Crawlers

Thanks

http://www.indies.co.inhttp://www.indieswebs.com