tambako the jaguar@flickr · tambako the [email protected] friday, may 22, 2009. agenda overview...

23
Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Tambako the [email protected] Friday, May 22, 2009

Upload: others

Post on 04-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Bixo - a webcrawler toolkitKen Krugler, Stefan Groschupf

Tambako the [email protected]

Friday, May 22, 2009

Page 2: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

AgendaOverviewBackgroundMotivationGoalsStatusDifferencesArchitectureData life cycleRobust TestingResources

[email protected]

Friday, May 22, 2009

Page 3: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Primary users will be companies extracting data from the web (not search)

Interested in subset of the web

Typically part of larger data processing system

Overview

Friday, May 22, 2009

Page 4: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

No good solution available

We need a toolkit

Missing from Nutch et al.

Easy to integrate

Easy to extend

Easy to understand

API vs CLI

Pluggable I/O

Avoid common problems

Spider traps & link farms

Slow servers

Hanging crawls

Motivation - tech

Friday, May 22, 2009

Page 5: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Screen scrape, data extraction

Artist websites, e.g. concert dates

Many pages from large sites

Just crawl, no index

One of many inputs into Business Intelligence

Integration in larger BI system (Cascading-based)

Motivation - EMI

Friday, May 22, 2009

Page 6: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Focused index for key partners

Data analysis and mining of 100m pages

Integration into existing log analysis and data mining systems (Cascading-based)

Low IT/Ops support requirements

Motivation - Share This

Friday, May 22, 2009

Page 7: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Goals

Fulfill key motivating requirements

OSS project with business-friendly license

Focus on vertical crawling, leverage other projects

Efficient execution in EC2/cloud environment

Grow OSS community

Friday, May 22, 2009

Page 8: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Current Status

We already do crawls in EC2

2 sponsored developers, since March 2009

MIT license

Todo:

Improve robots.txt handling

Bugfixes and many improvements

Website & documentation

A CLI for easy testing.

Friday, May 22, 2009

Page 9: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Differences (from Nutch)

Toolkit versus system - building blocks, not plugins

Workflow focus, versus system where you set conf and run a command

More emphasis on instrumentation - monitoring, error handling,

No search serving

Vertical crawl, not intranet or whole web

HTTP(S) only, not ftp, etc.

Friday, May 22, 2009

Page 10: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Differences (from Hadoop)

Not much, which is a good thing

Generates lots of data - want to store in S3, want to minimize writes

Heavy user of DNS server - extra set up for caching server

Fetch phase is unusual Cascading topology

Friday, May 22, 2009

Page 11: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Hadoop Intro

Open Source map reduce system

Execution layer - map reduce

Mapper, Reducer Tasks

Storage layer - (distributed) file system

Local FS, HDFS, S3, etc

Scales from single node to thousands

Friday, May 22, 2009

Page 12: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Cascading Intro

Data processing can be hard with Hadoop

Cascading extends Hadoop

Provides simple data processing API

Reusable (unix) pipe based concept

Sources and Sinks separated

HDFS, Hbase, JDBC, Aster etc.

Assemble Pipes, Source and Sink in a Flow

GPL or OEM, though might change

Friday, May 22, 2009

Page 13: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Architecture

Hadoop

Cascading

Bixo pipes

your java your groovy your jython

input output

single jvm server cluster

Friday, May 22, 2009

Page 14: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Data life cycle

Inject URLs in URL DB

Select URLs from URL DB - based on recrawl policy, or partner/domain, or type, etc

Normalize URLs

Score URLs

Group URLs

Fetch

Save content

and/or update URL DB

and/or analyze/parse content

Notice nothing about indexing, pushing out index, serving up index.

Meta data fully supported

Friday, May 22, 2009

Page 15: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Architecture - Pipes

fetch pipe parse pipe update url db pipeurl pipe

Friday, May 22, 2009

Page 16: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Import Url Pipe

Import SubAssembly

Each

URL Normalizing

IUrlFilter

Source

URL DB

Sink

URLs

Friday, May 22, 2009

Page 17: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Fetch Pipe

Fetch SubAssembly

Each

URL Domain Map

Each

URL Scoring

Group

By

URL Grouping

Every

Fetching

GroupingKeyGenerator IHttpFetcherScoreGenerator

URLs

Source

Pages & Status

Sink

Friday, May 22, 2009

Page 18: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Parse Pipe

Parse SubAssembly

Each

URL Domain Map

IParser

Pages

Source

ParsedText & OutLinks

Sink

Friday, May 22, 2009

Page 19: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Update Pipe

Update DB SubAssembly

Each

URL Normalizing

Group

By

URL Grouping

Every

URL Selection

IUrlFilter

URLs

Source

URL DB

Sink

LastUpdated

Friday, May 22, 2009

Page 20: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Output

MultiSinkTap

Sink

Each

URL Status

Each

URL Content IndexScheme

Sink Each

Lucene Index

Friday, May 22, 2009

Page 21: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Robust testing

Unit tests

Jetty with special request handlers

wrong content type

slow responses

wronger header

WebGraph test platform

test/simulate URL discovery

Looping/URL DB updates

page rank calcs, etc.

Wikipedia

large amount of data that can be "crawled" via local setup

http://webgraph.dsi.unimi.it/

Friday, May 22, 2009

Page 22: Tambako the Jaguar@flickr · Tambako the Jaguar@flickr.com Friday, May 22, 2009. Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle Robust

Resources

Web: http://bixo.101tec.com/

List: http://groups.yahoo.com/group/bixo-dev

Sources: https://github.com/emi/bixo/tree

Bug tracking: http://oss.101tec.com/jira/browse/bixo

Friday, May 22, 2009