harnessing the power of nutch with scala
DESCRIPTION
Introduction to NTRANSCRIPT
Crawling the web, Nutch with Scala
Vikas Hazrati @
2
about
CTO at Knoldus Software
Co-Founder at MyCellWasStolen.com
Community Editor at InfoQ.com
Dabbling with Scala – last 40 months
Enterprise grade implementations on Scala – 18 months
3
nutch
Web search software
lucene
solr
crawler link-graph parsing
4
nutch – but we have google!
transparent
understanding
extensible
5
nutch – basic architecture
crawler searcher
6
nutch - architecture
web databaseCrawl dbfetchlists
links
pages
segments
crawler
Recursive
7
nutch – crawl cyclegenerate – fetch – update cycle
Create crawldb
Inject root URLs In crawldb
Generate fetchlist
Fetch content
Update crawldb
repeat untildepth reached
Update segments
Index fetched pages
deduplication
Merge indexes forsearching
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
8
nutch - plugins
Create crawldb
Inject root URLs In crawldb
Generate fetchlist
Fetch content
Update crawldb
parser
HTMLParserFilter
URL Filter
scoring filter
generate – fetch – update cycle
9
nutch – extension points
plugin.xml
build.xml
ivy.xml
// tells Nutch about the plugin
// build the plugin
// plugin dependencies
src // plugin source
10
nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin>
11
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
LOG.debug("Parsing URL: " + content.getUrl());
} Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult;
}
12
scalaI have Java !
concurrency verbose
popular
OO library
Strongly typed
jvm
13
scalaJava:class Person { private String firstName; private String lastName; private int age;
public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; }
public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}
Scala:class Person(var firstName: String, var lastName: String, var age: Int)
Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i
14
scala
Java – everything is an object unless it is primitive
Scala – everything is an object. period.
Java – has operators (+, -, < ..) and methods
Scala – operators are methods
Java – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing
15
evolution
16
scala and concurrency
Fine grained coarse grained
Actors
17
actors
18
19
problem context
Aggregator
UGC
20
solution
Aggregator
Supplier 1
Supplier 2
Supplier 3
21
Create crawldb
Inject root URLs In crawldb
Generate fetchlist
Fetch content
Update crawldb
Supplier URLs
plugins written in Scala
22
logic
Crawl the supplier
Is URL interestingParse
Pass extraction to actor
seeddatabase
23
plugin - scalaclass DetailParserFilter extends HtmlParseFilter {
def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = {
if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult }
private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result }
private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml }
24
result
5 suppliers crawled
Crawl cycles run continuously for few days
> 500K seed data collected
All with Nutch and 823 lines of Scala code
25
demo
in action ….
26
resources
http://blog.knoldus.com
http://wiki.apache.org/nutch/NutchTutorial
http://www.scala-lang.org/