nutch 0.9 recrawling and merging

Recrawling and merging July 13, 2007 — nutch

As I mentioned in my introductory blog entry, I have already set up a working nutch installation and crawled/indexed some documents.

Now I have a different question: how can I evolve a corpus over time? Basically I want to start with a group of seed URLs and do a nutch crawl. There are two methodologies I know of so far: I’m not sure whether I want to do an “intranet crawl” or a “whole web crawl“. The first uses the “nutch crawl” command:

Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]

The “whole web crawl” breaks that down to its constituent steps; here’s one I did:nutch inject crawl/crawldb seednutch generate crawl/crawldb crawl/segmentss1=`ls -d crawl/segments/2* | tail -1`nutch fetch2 $s1nutch updatedb crawl/crawldb $s1nutch generate crawl/crawldb crawl/segments -topN 1000s2=`ls -d crawl/segments/2* | tail -1`nutch fetch2 $s2nutch updatedb crawl/crawldb $s2nutch generate crawl/crawldb crawl/segments -topN 1000s3=`ls -d crawl/segments/2* | tail -1`nutch fetch2 $s3nutch updatedb crawl/crawldb $s3nutch invertlinks crawl/linkdb -dir crawl/segmentsnutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

The essence of the above is:

1. inject 2. loop on these:

1. generate 2. fetch2 3. updatedb

3. invertlinks 4. index

So far I’ve built a corpus of several thousand documents. How should I add to it?

To be clear, I am, a bit, conflating two issues. Recrawling and merging are two separate operations. Recrawling seeks to go through the existing pages and update them. The wiki has a recrawl script (which is unfortunately not updated for version 0.9; whether it’s still good for 0.9 isn’t clear). Alternately, merging seeks to combine two (usually?/mostly?) disjoint sets of documents and attendant indexes. Merging has the MergeCrawl script which is detailed on the wiki, again, only through version 0.8. Neither of these scripts is in the distribution, why is that?

Neither of “recrawl” or “merge” are mentioned in the nutch tutorial.

I did a search for “merge” on nutch-user; I also did a search for “recrawl”. Then I followed a few of the threads:

“Nutch Crawl Vs. Merge Time Complexity” (Mar 2006) asks:

I’m using Nutch v0.7 and I’ve been running nutch on our company unix system and it was setup to crawl our intranet sites for updates daily, I’ve tried using the Merge, dedup, updatedb, and etc…I’d notice the time complexity and efficiency was less productive than doing a fresh new crawl. For example if I have two separate crawls from two different domains such as hotmail and yahoo, what

http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/

http://www.mail-archive.com/[email protected]/msg07675.html

http://www.mail-archive.com/search?q=recrawl&[email protected]

http://www.mail-archive.com/search?q=merge&[email protected]

http://lucene.apache.org/nutch/tutorial8.html

http://wiki.apache.org/nutch/MergeCrawl

http://wiki.apache.org/nutch/IntranetRecrawl

http://nutch.sourceforge.net/cgi-bin/twiki/view/Main/MergeOptions

https://issues.apache.org/jira/browse/NUTCH-511

http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling

http://lucene.apache.org/nutch/tutorial8.html#Intranet+Crawling

http://nutch.wordpress.com/2007/07/13/introductory-comments-to-this-blog/

would the time complexity for nutch to crawl this two domains and then do a merge compare to just doing a single full crawl of both domains? My guess would be that it will take nutch the same amount of times to do either one, if that is so is there a reason to use the Merge at all?

“Incremental indexing” (Jun 2007) asks:

As the size of my data keeps growing, and the indexing time grows evenfaster, I’m trying to switch from a “reindex all at every crawl” model to anincremental indexing one. I intend to keep the segments separate, but Iwant to index only the segment fetched during the last cycle, and then mergeindexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes referencesto segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear thatI can call “bin/nutch merge” with only two parameters: the original indexdirectory as destination, and the directory to be merged in the former:

$nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexesrm -rf $index_dir.old # just in case it's already theremv $index_dir $index_dir.oldmv $merged_indexes $index_dirrm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?

“Recrawl URLS” (Aug 2006) has a discussion between two people:

Q: I was searching for the method to add new url to the crawling url listand how to recrawl all urls…

A: You could use the command bin/nutch inject $nutch-dir/db -urlfileurlfile.txt. To recrawl your WebDB you can use thisscript.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

[That's the same as the recrawl script from the wiki. --Kai]

Take a look to the adddays argument and to the configuration propertydb.default.fetch.interval. They influence to the result.

Q: I have another question, I done what you give me… But it inject the

http://wiki.apache.org/nutch/IntranetRecrawl

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html



new urls and “recrawl” it, but against the first crawl It doesn’tdownload the web pages and really crawl them… perhaps I’m mistakingsomewhere…Any idea ?

A: In the nutch conf/nutch-default.xml configuration file exist a property calldb.default.fetch.interval. When you crawl a site, nutch schedules the nextfetch to “today + db.default.fetch.interval” days. If you execute the recrawlcommand and the pages that you fetch don’t reach this date, they won’t bere-fetched. When you add new urls to the webdb, they will be ready to befetch. So at this moment only this pages will be fetched by the recrawlscript.

Q: But the websites just added hasn’t been yet crawled… And they’re notcrawled during recrawl…Does “bin/nutch purge” will restart all ?

A: This command “bin/nutch purge” doesn’t exist. Well I can’t say you what ishappening. Give me the output when you run the recrawl.

I found that a bit inconclusive. Points of interest:$ nutch inject Usage: Injector <crawldb> <url_dir>

The above is the usage printed for “nutch inject” on the command line. And now from nutch-default.xml:<property> <name>db.default.fetch.interval</name> <value>30</value> <description>(DEPRECATED) The default number of days between re-fetches of a page. </description></property>

Ok, great, that’s deprecated. I really need some current documentation!

“Recrawling… Methodology?” (Jul 2006) asks:

I need some help clarifying if recrawling is doing exactly what I think it is. Here’s the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script found below: http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 [the standard script –Kai]

In my recrawl, I also specify a depth of 2. It reindexes each of the pages before, and if they have changed update the pages content. If they have changed and new links exist, the links are followed to a maximum depth of 2.

This is how I think a typical recrawl should work. However, when I recrawl using the script linked to above, tons of new pages are indexed, whether they have changed or not. It seems as if I crawl the content with a depth of 2, and then come back and recrawl with a depth of 2, it really adds a couple of crawl depth levels and the outcome is that I have done a crawl with a depth of 4 (instead of crawl with a depth of 2 and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:for (how many depth levels specified)

http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03


$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddayssegment=`ls -d $segments_dir/* | tail -1`$nutch_dir/nutch fetch $segment$nutch_dir/nutch updatedb $webdb_dir $segment

• invertlinks • index • dedup • merge

Basically what made me wonder is that it took me 2 minutes to do the crawl. It’s taken me over 3 hours and still going to do the recrawl (same depth levels specified). After I recrawl once, I believe it then speeds up.

I don’t know if that guy ever fixed his problem. He was doing the same thing as I except that he started initially with an “intranet crawl” and built on it (I deleted my initial “intranet crawl” and recrawled incrementally).

I’m not sure if repetition will help me, but here’s another description of how crawl works – “Re: How to recrawl urls” (Dec 2005):

The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist. Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set date and time to current date and time, while all just-crawled URLs date and time are set to next 30 days (these things happen actually in FetchListTool). So all extracted links will be crawled for the next time, but not the just-crawled URLs. So on and soforth.

Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all “just-crawled” urls will be taken out to recrawl. That’s why we need to maintain a live crawler at that time. This could be done using cron job, I think.

Slightly further into the above thread, Stefan Groschupf suggests: “do the steps manually as described here: SimpleMapReduceTutorial“; that tutorial, written by Earl Cahill in Oct 2005, has these steps (plus explanation):

cd nutch/branches/mapredmkdir urlsecho "http://lucene.apache.org/nutch/" > urls/urlsperl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt./bin/nutch crawl urlsCRAWLDB=`find crawl-2* -name crawldb`SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`./bin/nutch generate $CRAWLDB $SEGMENTS_DIRSEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`./bin/nutch fetch $SEGMENT./bin/nutch updatedb $CRAWLDB $SEGMENTLINKDB=`find crawl-2* -name linkdb -maxdepth 1`SEGMENTS=`find crawl-2* -name segments -maxdepth 1`./bin/nutch invertlinks $LINKDB $SEGMENTSmkdir myindexls -alR myindex

Here’s a somewhat basic discussion on merging: “Problem with merge-output” (Jun 2007)

Q: After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-


http://wiki.apache.org/nutch/SimpleMapReduceTutorial



output_new, then mv merge-output_new to merge-output.

A: This is something I usually do:-

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*rm -rf crawl/segments/*mv crawl/MERGEDsegments/* crawl/segments

You might want to replace the second statement with a ‘mv’ statement to backup the segments.

Here’s another: “Simple question about the merge tool” (Jul 2005):

Q: I have a simple question about how to use the merge tool. I’ve done three small crawls resulting in three small segment directories. How can I merge these into one directory with one index? I notice the merge command options:

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex segments...

I don’t really understand what it’s doing with the outputIndex and the segments. Will this automatically delete segments after merging them into the output?

A: Use the bin/nutch mergesegs to merge many segments into one.

Mergesegs has the following usage as reported by running “nutch mergesegs” on the command line:I’m curious about the usage of the merge command. Here’s a console session detailing these:

$ nutch | grep merg mergedb merge crawldb-s, with optional filtering mergesegs merge several segments, with optional filtering and slicing mergelinkdb merge linkdb-s, with optional filtering merge merge several segment indexes$ nutch mergedbUsage: CrawlDbMerger [ ...] [-normalize] [-filter] output_crawldb output CrawlDb crawldb1 ... input CrawlDb-s (single input CrawlDb is ok) -normalize use URLNormalizer on urls in the crawldb(s) (usually not needed) -filter use URLFilters on urls in the crawldb(s)$ nutch mergesegsSegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN] output_dir name of the parent dir for output segment slice(s) -dir segments parent dir containing several segments seg1 seg2 ... list of segment dirs -filter filter out URL-s prohibited by current URLFilters -slice NNNN create many output segments, each containing NNNN URLs$ nutch mergelinkdbUsage: LinkDbMerger [ ...] [-normalize] [-filter] output_linkdb output LinkDb linkdb1 ... input LinkDb-s (single input LinkDb is ok) -normalize use URLNormalizer on both fromUrls and toUrls in linkdb(s) (usually not needed) -filter use URLFilters on both fromUrls and toUrls in linkdb(s)$ nutch mergeUsage: IndexMerger [-workingdir ] outputIndex indexesDir...

Ah: the nutch javadoc has some comments on each of the above classes:

http://lucene.apache.org/nutch/apidocs/index.html


CrawlDbMerger - “nutch mergedb” – see also mergedb wikiorg.apache.nutch.crawl

Class CrawlDbMergerjava.lang.Object org.apache.hadoop.util.ToolBase org.apache.nutch.crawl.CrawlDbMerger

All Implemented Interfaces:

Configurable, Tool

public class CrawlDbMerger extends ToolBase

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

It’s possible to use this tool just for filtering – in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Author:

Andrzej Bialecki

SegmentMerger - “nutch mergesegs” – see also mergesegs wikiorg.apache.nutch.segment

Class SegmentMergerjava.lang.Object org.apache.hadoop.conf.Configured org.apache.nutch.segment.SegmentMerger


Configurable, Closeable, JobConfigurable, Mapper, Reducer

public class SegmentMerger extends Configured

implements Mapper, Reducer

This tool takes several segments and merges their data together. Only the latest versions of data is

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Reducer.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Mapper.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configured.html


http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Mapper.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConfigurable.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/Closeable.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configured.html

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Object.html

http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDatum.html#getFetchTime()

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/util/ToolBase.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/util/Tool.html




http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html

retained.

Optionally, you can apply current URLFilters to remove prohibited URL-s.

Also, it’s possible to slice the resulting segment into chunks of fixed size.

Important Notes

Which parts are merged?

It doesn’t make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.

Merging fetchlists

Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the Generator doesn’t ensure that fetchlist parts for each map task are disjoint.

Duplicate content

Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL-s. In other words, running DeleteDuplicates is still necessary.

For some types of data (especially ParseText) it’s not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with “higher” names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.

Merging and indexes

Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn’t use existing indexes in any way, so if you plan to merge segments you don’t have to index them prior to merging.

Author:

Andrzej Bialecki

LinkDbMerger - “nutch mergelinkdb” – see also mergelinkdb wiki

http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/Generator.html

org.apache.nutch.crawl

Class LinkDbMerger

java.lang.Object org.apache.hadoop.util.ToolBase org.apache.nutch.crawl.LinkDbMerger


Configurable, Closeable, JobConfigurable, Reducer, Tool

public class LinkDbMerger extends ToolBase

implements Reducer

This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.

It’s possible to use this tool just for filtering – in that case only one LinkDb should be specified in arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won’t count when checking the above-mentioned maximum limit.

Author:

Andrzej Bialecki

IndexMerger – “nutch merge” – see also merge wikiorg.apache.nutch.indexer

Class IndexMergerjava.lang.Object org.apache.hadoop.util.ToolBase org.apache.nutch.indexer.IndexMerger


Configurable, Tool

public class IndexMerger extends ToolBase






http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_merge

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html





http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConfigurable.html

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/Closeable.html




IndexMerger creates an index for the output corresponding to a single fetcher run.

Author:

Doug Cutting, Mike Cafarella

I wrote a post asking for clarification about the above four merge commands: “four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge” (Jul 2007).

Q: Naively: why are there four merge commands? Are some subsets of the others? Are they used in conjunction? What are the usage scenarios of each?

A: Each is used in a different scenario

mergedb: as its name does not imply, it is used to merge crawldb. So consider this mergecrawldb

mergesegs: merges segments. It merges <segment>/{content,crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text} information from different segments.

merge: Merges lucene indexes. After a index job, you end up with a indexes directory with a bunch of part-<num> directories inside. Command merge takes such a directory and produces a single index. A single index has a better performance (I think). You can say that merge is poorly named, it should have been called mergeindexes or something.

mergelinkdb: Should be obvious, merges linkdb-s.

So none of them is a subset of another. They all have different purposes. It is kind of confusing to have a “merge” command that only merges indexes, so perhaps we can add a mergeindexes command, keep merge for some time (noting that it has been deprecated) then remove it.

Q: It seems most of the nutch-user discussions I’ve seen so far relate to the simple merge command. Are the first three “advanced commands”?

A: They serve different purpose – let’s assume that somehow you’ve got two crawldb-s, e.g. you ran two crawls with different seed lists and different filters. Now you want to take these collections of urls and create a one big crawl. Then you would use mergedb to merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments

And a simple “merge” merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle.

“Incremental indexing” (June 2007) discusses the complex aspects of recrawling/merging rather clearly. It’s too bad nobody on nutch-user replied to it.

As the size of my data keeps growing, and the indexing time grows even faster, I’m trying to switch from a “reindex all at every crawl” model to an incremental indexing one. I intend to keep the segments separate, but Iwant to index only the segment fetched during the last cycle, and then merge indexes and perhaps




linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that I can call “bin/nutch merge” with only two parameters: the original index directory as destination, and the directory to be merged in the former:

$nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexesrm -rf $index_dir.old # just in case it's already theremv $index_dir $index_dir.oldmv $merged_indexes $index_dirrm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?

Here’s a very current and rather complex question, with replies, titled “incremental growing index” (Jul 2007):

Q: Our crawler generates and fetches segments continuously. We’d like to index and merge each new segment immediately (or with a small delay) such that our index grows incrementally. This is unlike the normal situation where one would create a linkdb and an index of all segments at once, after the crawl has finished. The problem we have is that Nutch currently needs the complete linkdb and crawldb each time we want to index a single segment.

A: The reason for wanting the linkdb is the anchor information. If you don’t need any anchor information, you can provide an empty linkdb.

The reason why crawldb is needed is to get the current page status information (which may have changed in the meantime due to subsequent crawldb updates from newer segments). If you don’t need this information, you can modify Indexer.reduce() (~line 212) method to allow for this, and then remove the line in Indexer.index() that adds crawldb to the list of input paths.

Q: The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and segment). This includes all data from the linkdb and crawldb that we actually don’t need since we are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more of a problem.

A: Is this really a problem for you now? Unless your segments are tiny, the indexing


http://wiki.apache.org/nutch/MergeCrawl

process will be dominated by I/O from the processing of parseText / parseData and Lucene operations.

Q: Any ideas on how to tackle this issue? Is it feasible to lookup the corresponding linkdb and crawldb data for each key (url) in the segment before or during indexing?

A: It would be probably too slow, unless you made a copy of linkdb/crawldb on the local FS-es of each node. But at this point the benefit of this change would be doubtful, because of all the I/O you would need to do to prepare each task’s environment …

Q: Thanks Andrzej. Perhaps these numbers make our issue more clear:

- after a week of (internet) crawling, the crawldb contains about 22M documents.- 6M documents are fetched, in 257 segments (topN = 25,000)- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc)- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc)- size of a segment = somewhere between 100 and 500 MB (25K docs, 20 kB/doc (max))

As you can see: for a segment of 500 MB, more than 99% of the IO during indexing is due to the linkdb and crawldb. We could increase the size of our segments, but in the end this only delays the problem. We are now indexing without the linkdb. This reduces the time needed by a factor 10. But we would really like to have the link texts back in again in the future.

Here’s a thread I started a couple weeks back: “Interrupting a nutch crawl — or use topN?” (Jun 2007):

I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully stop it (I don’t expect it to complete by then). Is there a way to do this? I want it to stop crawling then build the luceneindex. Note that I used a simple nutch crawl command, rather than the “whole web” crawling methodology:

nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10

Or is it better to use the -topN option?

Some documentation for topN:

“Re: How to terminate the crawl?”

“You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can befetched is topN*depth.”

Or from the tutorial:

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

http://lucene.apache.org/nutch/tutorial8.html#Intranet%3A+Running+the+Crawl



Typically one starts testing one’s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

Here was one response to my question:

I use a iterative approach using a script similar to what Sami blogs about here:

Online indexing – integrating Nutch with Solr

excerpt:

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server – for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on “live” index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit – you would get rid of manual index life cycle management in Nutch and let Solr handle your index.

I then issue a crawl of 10,000 URLs at a time, and just repeat the process for as long as the window available. because I use solr to store the crawl results. It makes the index available during the crawl window. But I’m a relative newbie as well, so look forward what the experts say.

I looked at Sami Siren’s script; it’s pretty much the same as what I did a the top of this blog, except his script “will execute one iteration of fetching and indexing.” The script’s only real difference is that it uses ’SolrIndexer’ (that you write) rather than the normal Indexer class, org.apache.nutch.indexer.Indexer (here’s the Indexer javadoc). I think I guess correctly that Indexer is what runs when you do “nutch index” from the command line. Just to beat a dead horse a bit more, here’s an excerpt from Sami’s script:

bin/nutch inject $BASEDIR/crawldb urlscheckStatusbin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCScheckStatusSEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -f1|sort|tail -1`echo processing segment $SEGMENTbin/nutch fetch $SEGMENT -threads 20checkStatusbin/nutch updatedb $BASEDIR/crawldb $SEGMENT -filtercheckStatusbin/nutch invertlinks $BASEDIR/linkdb $SEGMENTcheckStatusbin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENTcheckStatus

checkStatus is just a short function in the script that looks to see if any errors were generated by whatever command ran last. I also note that Sami is using a hadoop command that I don’t understand; the NutchHadoopTutorial mentions ‘hadoop dfs’ … but I think I may be drifting off topic.

http://wiki.apache.org/nutch/NutchHadoopTutorial

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/Indexer.html

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java?view=markup

http://www.foofactory.fi/files/nutch-solr/crawl.sh

http://lucene.apache.org/solr

http://lucene.apache.org/nutch

http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html

Here is the other response I got to my post:

In the past Andrzej put some stuff related to your issue in the Jira. Try tolook it up there.

Found it http://issues.apache.org/jira/browse/NUTCH-368

NUTCH-368: Message queueing system (Sep 2006)

This is an implementation of a filesystem-based message queueing system. The motivation for this functionality is explained in HADOOP-490

HADOOP-490: Add ability to send “signals” to jobs and tasks

In some cases it would be useful to be able to “signal” a job and its tasks about some external condition, or to broadcast a specific message to all tasks in a job. Currently we can only send a single pseudo-signal, that is to kill a job.

This patch uses the message queueing framework to implement the following functionality in Fetcher:

* ability to gracefully stop fetching the current segment. This is different from simply killing the job in that the partial results (partially fetched segment) are available and can be further processed. This is especially useful for fetching large segments with long “tails”, i.e. pages which are fetched very slowly, either because of politeness settings or the target site’s bandwidth limitations.

* ability to dynamicaly adjust the number of fetcher threads. For a long-running fetch job it makes sense to decrease the number of fetcher threads during the day, and increase it during the night. This can be done now with a cron script, using the MsgQueueTool command-line.

It’s worthwhile to note that the patch itself is trivial, and most of the work is done by the MQ framework.

After you apply this patch you can start a long-running fetcher job, check its <jobId>, and control the fetcher this way:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl THREADS 50

This adjusts the number of threads to 50 (starting more threads or stopping some threads as necessary).

Then run:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl HALT

This will gracefully shut down all threads after they finish fetching their current url, and finish the job, keeping the partial segment data intact.

https://issues.apache.org/jira/browse/HADOOP-490

http://issues.apache.org/jira/browse/NUTCH-368

Susam Pal has posted (Aug 2007) a new script to crawl with nutch 0.9:#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl# Usage: bin/runbot [safe]# If executed in 'safe' mode, it doesn't delete the temporary# directories generated during crawl. This might be helpful for# analysis and recovery in case a crawl fails.## Author: Susam Pal

depth=2threads=50adddays=5topN=2 # Comment this statement if you don't want to set topN value

# Parse argumentsif [ "$1" == "safe" ]then safe=yesfi

if [ -z "$NUTCH_HOME" ]then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the scriptelse echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOMEfi

if [ -z "$CATALINA_HOME" ]then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the scriptelse echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOMEfi

if [ -n "$topN" ]then topN="--topN $rank"else topN=""fi

steps=8echo "----- Inject (Step 1 of $steps) -----"$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"for((i=0; i < $depth; i++))do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN-adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi

http://susam.in/

segment=`ls -d crawl/segments/* | tail -1`

$NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth $depth failed. Deleting it." rm -rf $segment continue fi

$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segmentdone

echo "----- Merge Segments (Step 3 of $steps) -----"$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*if [ "$safe" != "yes" ]then rm -rf crawl/segments/*else mkdir crawl/FETCHEDsegments mv --verbose crawl/segments/* crawl/FETCHEDsegmentsfi

mv --verbose crawl/MERGEDsegments/* crawl/segmentsrmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldbcrawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]then rm -rf crawl/NEWindexesfi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"if [ "$safe" != "yes" ]then touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml echo Done!else echo runbot: Can not reload index in safe mode. echo runbot: Please reload it manually using the following command: echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xmlfi

echo "runbot: FINISHED: Crawl completed!"

Susam comments as follows:

I have written this script to crawl with Nutch 0.9. Though, I have tried to take care that this should work for re-crawls as well, but I have never done any real world testing for re-crawls. I use this to crawl. You may try this out. We can make some changes if this is not found to be appropriate for re-


crawls.

Posted in Merging, Recrawling. 6 Comments »

Introductory comments to this blog July 13, 2007 — nutch

From wikipedia:

Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component.

I am writing this blog in order to publicly document my exploration of the nutch crawler and get feedback about what other folks have tried or discovered. I’ve already been using nutch for a few weeks so this blog doesn’t start completely at the beginning for me, but I’ll try to be explanatory in how I write here. Like many open source projects, nutch is poorly documented. This means that in order to find answers one has to make extensive use of google plus comb the nutch forums: nutch-user and nutch-dev. (Those links are hosted at www.mail-archive.com; they’re also hosted by www.nabble.com in a different format here: nutch-user and nutch-dev.) I’ve found that people are pretty responsive on nutch-user. The nutch to-do list, bugs, and enhancements are listed using JIRA software at issues.apache.org/jira/browse/Nutch.

Backdrop: I had latitude in making a choice of crawler/indexer, so in the beginning I read some general literature such as “Crawling the Web” by Gautam Pant, Padmini Srinivasan, and Filippo Menczer. On approaches to search the entertaining “Psychosomatic addict insane” (2007) discusses latent semantic indexing and contextual network graphs. And let’s not forget spreading activation networks. Writing a crawler is not easy so I looked at some java-based open source crawlers and started examining Heritrix. In a conversation with Gordon Mohr of the internet archive I decided to go with nutch as he said Heritrix was more focused on storing precise renditions of web pages and on storing multiple versions of the same page as it changes over time. On the other hand, nutch just stores text, and it directly creates and accesses Lucene indexes whereas the internet archive also has to use NutchWax to interact with Lucene.

The current version of nutch is 0.9; but rather than the main release I’m using one of the nightly builds that fixes a bug I ran into (see the NUTCH-505 JIRA). The nightly build also has a more advanced RSS feed handler. But I’m getting ahead of myself.

The best overall introductory article to nutch I’ve found so far is the following two-parter written by Tom White in January of 2006. It has a brief overall description of nutch’s architecture, then delves into the specifics of crawling a small example site; it tells how to set up nutch as well as tomcat, and what kind of sanity checks to do on the results you get back.

• Introduction to Nutch, Part 1: Crawling • Introduction to Nutch, Part 2: Searching .

On the architecture:

Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users’ search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments [a collection of pages fetched and indexed by the crawler in a single run] below in order to produce page summaries and to provide access to cached pages.)

The nutch site itself has a few items of note:

http://lucene.apache.org/nutch/

http://today.java.net/lpt/a/266

http://today.java.net/lpt/a/255

http://today.java.net/pub/au/294

http://issues.apache.org/jira/browse/NUTCH-505

http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/

http://archive-access.sourceforge.net/projects/nutch/

http://crawler.archive.org/

http://java-source.net/open-source/crawlers

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144&page=3


http://www.public.asu.edu/~fgelgi/mining/SAN-techrep.pdf

http://www.hirank.com/semantic-indexing-project/papers/Contextual_Network_Graphs.pdf

http://www.hirank.com/semantic-indexing-project/lsi/tutorial.htm

http://deflatermouse.livejournal.com/130947.html

http://traumwind.de/computer/search/three_approaches_to_searching.html

http://mia.ece.uic.edu/~papers/MediaBot/pdf00001.pdf

http://issues.apache.org/jira/browse/Nutch

http://www.atlassian.com/software/jira

http://www.nabble.com/Nutch---Dev-f373.html

http://www.nabble.com/Nutch---User-f375.html

http://www.mail-archive.com/[email protected]/

http://www.mail-archive.com/[email protected]/index.html

http://lucene.apache.org/nutch/

http://en.wikipedia.org/wiki/Lucene

http://en.wikipedia.org/wiki/Search_engine

http://en.wikipedia.org/wiki/Open_source

http://en.wikipedia.org/wiki/Nutch

http://nutch.wordpress.com/2007/07/13/introductory-comments-to-this-blog/

http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/#comments

http://en.wordpress.com/tag/recrawling/

http://en.wordpress.com/tag/merging/

• The Version 0.8 Tutorial – Like the Tom White article, this has a lot of nuts-and-bolts advice. I believe it is current for version 0.9, though I can’t guarantee that.

• FAQ – As of this writing the FAQ has about 40 questions, divided into sections. Some of the sections I found worthwhile were:

• Injecting • Fetching • Indexing • Segment Handling • Searching

• API doc – (sparse)

There is an article written by nutch auther Doug Cutting as well as Rohit Khare, Kragen Sitaker, and Adam Rifkinthat. It has a clean description of nutch’s architecture and is entitled “Nutch: A Flexible and Scalable Open-Source Web Search Engine“.

Excerpt:

4.1 Crawling: An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetchlist from the webdb, fetching those pages, parsing those for links, then updating the webdb. In the terminology of [4], Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop-with-threshold (which requires feedback from scoring and specifying a floor). It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by default) regardless of how frequently they change There is no feedback loop yet, though the design of Page.java can set individual recrawl-deadlines on every page). The fetching process must also respect bandwidth and other limitations of the target website. However, any polite solution requires coordination before fetching; Nutch uses the most straightforward localization of references possible: namely, making all fetches from a particular host run on one machine.

Another slide show (PDF) by Doug Cutting, ”Nutch, Open-Source Web Search“ shows the architecture:

Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display.

http://nutch.sourceforge.net/twiki/Main/Presentations/www2004.pdf

http://wiki.commerce.net/images/0/06/CN-TR-04-04.pdf

http://wiki.commerce.net/images/0/06/CN-TR-04-04.pdf

http://www.theserverside.com/tt/talks/videos/DougCutting/interview.tss?bandwidth=dsl

http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/docs/api/index.html

http://wiki.apache.org/nutch/FAQ#head-d6a65734c78e648e43e35b1591afc1bb10c2eeac

http://wiki.apache.org/nutch/FAQ#head-b74bbcca9fa6ac22157ac95e641154e27271f677

http://wiki.apache.org/nutch/FAQ#head-aa91097d9ffe4af310a4a08f6528b83c8c7d922e

http://wiki.apache.org/nutch/FAQ#head-9e3c1e98428ea45b8a8e1a39a3bdfcc3a2b1cb54

http://wiki.apache.org/nutch/FAQ#head-e29ac291627b8a28d704890e1738631f90e4cd64

http://wiki.apache.org/nutch/FAQ

http://lucene.apache.org/nutch/tutorial8.html

Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes.

Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched.

Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch.

There is a lengthy video presentation (71 minutes) with Doug Cutting, sponsored by IIIS in Helsinki, 2006. It has an associated PDF slide show entitled “Open Source Platforms for Search“. The introduction has a philosophical discourse on open source software then gets down to a meaty technical discussion after about eight minutes. For instance, Doug discusses that with a single person as administrator, nutch scales well up to about 100 million documents. Beyond that, billions of pages are “operationally onerous”.

One of the more widely linked articles articles by Doug Cutting and Mike Cafarella is “Building Nutch: Open Source Search” (printer friendly version). On page 3 they outline nutch’s operational costs–note that these $ estimates were done in early 2004:

A typical back-end machine is a single-processor box with 1 gigabyte of RAM, a RAID controller, and eight hard drives. The filesystem is mirrored (RAID level 1) and provides 1 terabyte of reliable storage. Such a machine can be assembled for a cost of about $3,000…. A typical front-end machine is a single-processor box with 4 gigabytes of RAM and a single hard drive. Such a machine can be assembled for about $1,000…. Note that as traffic increases, front-end hardware quickly becomes the dominant hardware cost.

A 2007 paper from IBM Research entitled “Scalability of the Nutch Search Engine” explores some blade server configurations and uses mathematical models to conclude that nutch can scale well past the base cases they actually run. Note that the paper is about the index/search aspect of nutch rather than the crawling.

Search workloads behave well in a scale-out environment. The highly parallel nature of this workload, combined with a fairly predictable behavior in terms of processor, network and storage scalability, makes search a perfect candidate for scale-out. Scalability to thousands of nodes is well within reach, based on our evaluation that combines measurement data and modeling.

Lucene is the searching/indexing component of nutch; one of the things that attracted me to nutch was that I would be able to have an end-to-end, customizable package to implement search. And either lucene or nutch can be used for the query processing; nutch just has a simpler query syntax: it is optimized for the most common web queries so it doesn’t support OR queries, for instance. There are other crawlers, such as Heritrix which is very robust and is used by the internet archive, and other indexers like Xapian, which is very performant. ‘Archiving “Katrina” Lessons Learned‘ was a project that chose to use Heritrix and NutchWax. For now I’m happy with nutch+lucene. The one book I found that has much to say about Lucene (and even it has only minimal coverage of nutch) is Lucene in Action by Erik Hatcher and Otis Gospodnetic. I should also mention that the book has thorough coverage of Luke, a tool that is useful for playing with lucene indexes. The apache lucene mailing lists in searchable form are java-user and java-dev. The lucene FAQ is frequently updated.

http://wiki.apache.org/lucene-java/LuceneFAQ

http://www.gossamer-threads.com/lists/lucene/java-dev/

http://www.gossamer-threads.com/lists/lucene/java-user/

http://lucene.apache.org/java/docs/mailinglists.html

http://www.getopt.org/luke/

http://www.manning.com/hatcher2/

http://www.diglib.org/forums/fall2006/presentations/carpenter-2006-11.pdf

http://www.xapian.org/

http://www.archive.org/

http://crawler.archive.org/

http://ref.syr.edu/search/en/help.html

http://lucene.apache.org/java/docs/

http://portal.acm.org/citation.cfm?id=1274971.1274975&coll=GUIDE&dl=GUIDE&CFID=15151515&CFTOKEN=6184618


http://www.acmqueue.com/modules.php?name=Content&pa=printer_friendly&pid=144&page=1

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144

http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144

http://velblod.videolectures.net/pascal/2006/iiia06_helsinki/cutting_doug/cutting_doug_00.pdf

http://videolectures.net/iiia06_cutting_ense/

nutch 0.9 recrawling and merging

Documents