harnessing the power of nutch with scala

Crawling the web, Nutch with Scala

Vikas Hazrati @

2

about

CTO at Knoldus Software

Co-Founder at MyCellWasStolen.com

Community Editor at InfoQ.com

Dabbling with Scala – last 40 months

Enterprise grade implementations on Scala – 18 months

http://www.knoldus.com/

http://www.mycellwasstolen.com/

http://www.infoq.com/

3

nutch

Web search software

lucene

solr

crawler link-graph parsing

4

nutch – but we have google!

transparent

understanding

extensible

5

nutch – basic architecture

crawler searcher

6

nutch - architecture

web databaseCrawl dbfetchlists

links

pages

segments

crawler

Recursive

7

nutch – crawl cyclegenerate – fetch – update cycle

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

repeat untildepth reached

Update segments

Index fetched pages

deduplication

Merge indexes forsearching

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

8

nutch - plugins

Create crawldb


Generate fetchlist

Fetch content

Update crawldb

parser

HTMLParserFilter

URL Filter

scoring filter

generate – fetch – update cycle

9

nutch – extension points

plugin.xml

build.xml

ivy.xml

// tells Nutch about the plugin

// build the plugin

// plugin dependencies

src // plugin source

10

nutch - example

<plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin>

11

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

LOG.debug("Parsing URL: " + content.getUrl());

} Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult;

}

12

scalaI have Java !

concurrency verbose

popular

OO library

Strongly typed

jvm

13

scalaJava:class Person { private String firstName; private String lastName; private int age;

public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; }

public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}

Scala:class Person(var firstName: String, var lastName: String, var age: Int)

Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i

14

scala

Java – everything is an object unless it is primitive

Scala – everything is an object. period.

Java – has operators (+, -, < ..) and methods

Scala – operators are methods

Java – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing

15

evolution

16

scala and concurrency

Fine grained coarse grained

Actors

17

actors

19

problem context

Aggregator

UGC

20

solution

Aggregator

Supplier 1

Supplier 2

Supplier 3

21

Create crawldb


Generate fetchlist

Fetch content

Update crawldb

Supplier URLs

plugins written in Scala

22

logic

Crawl the supplier

Is URL interestingParse

Pass extraction to actor

seeddatabase

23

plugin - scalaclass DetailParserFilter extends HtmlParseFilter {

def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = {

if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult }

private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result }

private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml }

24

result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected

All with Nutch and 823 lines of Scala code

25

demo

in action ….

26

resources

http://blog.knoldus.com

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

[email protected]

http://blog.knoldus.com/

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

mailto:[email protected]

harnessing the power of nutch with scala

Technology

lastname lastname

int age

age age

string lastname

firstname firstname

string firstname

string

parseresult