incorporating site-level knowledge for incremental ...€¦ · incorporating site-level knowledge...

Incorporating Site-Level Knowledge for IncrementalCrawling of Web Forums: A List-wise Strategy∗

Jiang-Ming Yang†, Rui Cai†, Chunsong Wang‡, Hua Huang§, Lei Zhang†, Wei-Ying Ma†

†Microsoft Research, Asia. {jmyang, ruicai, leizhang, wyma}@microsoft.com‡University of Wisconsin-Madison. [email protected]

§Beijing University of Posts and Telecommunications. [email protected]

ABSTRACTWe study in this paper the problem of incremental crawlingof web forums, which is a very fundamental yet challeng-ing step in many web applications. Traditional approachesmainly focus on scheduling the revisiting strategy of each in-dividual page. However, simply assigning different weightsfor different individual pages is usually inefficient in crawlingforum sites because of the different characteristics betweenforum sites and general websites. Instead of treating each in-dividual page independently, we propose a list-wise strategyby taking into account the site-level knowledge. Such site-level knowledge is mined through reconstructing the link-ing structure, called sitemap, for a given forum site. Withthe sitemap, posts from the same thread but distributed onvarious pages can be concatenated according to their times-tamps. After that, for each thread, we employ a regressionmodel to predict the time when the next post arrives. Basedon this model, we develop an efficient crawler which is 260%faster than some state-of-the-art methods in terms of fetch-ing new generated content; and meanwhile our crawler alsoensure a high coverage ratio. Experimental results showpromising performance of Coverage, Bandwidth utilization,and Timeliness of our crawler on 18 various forums.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval - clustering, information filtering

General TermsAlgorithms, Performance, Experimentation

KeywordsWeb Forum, Sitemap, Incremental Crawling

1. INTRODUCTIONDue to the explosive growth of Web 2.0, web forum (also

named bulletin or discussion board) is becoming an impor-tant data resource on the Web1, and many research effortsand commercial systems, such as Google, Yahoo!, and Live,

∗This work was performed when the 3rd and the 4th authorswere interns in Microsoft Research Asia.1http://en.wikipedia.org/wiki/Internet forum

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’09, June 28–July 1, 2009, Paris, France.Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$5.00.

have begun to leverage information extracted from forumdata to build various web applications. This includes, forexample, understanding user opinions from digital productreviews2 for both product recommendation and business in-telligence, integrating travel notes for travel recommenda-tions3, and extracting question-answer pairs for QnA ser-vice [8], etc.

Fetching forum data (mainly user created posts) from var-ious forum sites is the fundamental step of most related webapplications. To satisfy the application requirements, anideal forum crawler should make a tradeoff between updat-ing existing discussion threads (i.e., completeness) and get-ting new discussion threads in time (i.e., timeliness). Com-pleteness refers to identify updated discussion threads anddownload the newly published content. This is importantto applications like online QnA services. As pagination iswidely used in web forums, the posts of a question and itsbest answers are most likely to be distributed in differentpages. In this case, a forum crawler should fetch all thepages from a discussion thread to ensure not missing anybest-answers. Timeliness refers to efficiently discover anddownload newly published discussion threads. This is alsoimportant because many web applications need fresh infor-mation to response quickly to news events and new productreleases. For example, users may be interested in knowingthe fair and square feedbacks of iPhone shortly after it wasreleased; and this creates the need to rapidly gather con-sumer comments of iPhone from various web forums.

However, most existing crawling strategies designed forgeneral websites may not be suitable for forum crawling,and cannot well satisfy the above requirements. This is be-cause web forums have some unique characteristics. (I) Con-tent information in forum sites is organized in form of list.Because forum content, such as a long discussion thread, isusually too huge to be displaced in one single page, most webforums adopt pagination to divide long content into multiplepages, as shown in Fig. 1. (II) In web forums, pages can becategorized into different types according to their functions.The most two common page types in web forums are index-of-thread and post-of-thread. As shown in Fig. 1(a), a list ofindex-of-thread pages (called index list) shows the basic in-formation (such as title, author, and creation-time) of all thediscussion threads in a discussion board or sub-boards; andas shown in Fig. 1(b), a list of post-of-thread pages (calledpost list) consists of the detailed content of all the posts fromthe corresponding thread. Unfortunately, these two charac-teristics are ignored by most existing crawlers. They simply

2www.dpreview.com3www.tripadvisor.com

1375

Page 4294

Pagination Linkshttp://forums.asp.net/15.aspx

(a) An exemplary index list and its index-of-thread pages

Page 1 Page 2 Page 13

Pagination Linkshttp://forums.asp.net/t/956540.aspx

(b) An exemplary post list and its post-of-thread pages

(a)--------

(b)--------

(c)--------

(d)--------

(e)--------

(f)--------

(g)--------

(h)--------

(i)--------

(*)--------

(*)--------

(*)--------

(a)--------

(b)--------

(c)--------

(d)--------

(e)--------

(f)--------

(g)--------

(h)--------

(i)--------

P1 P2 P1 P2 P3

T1 T2

(c) Data growth of an index list

Figure 1: An illustration of (a) Index-of-threadpages3 in sub-board “Getting Started”, (b) Post-of-thread pages4 in thread “HOW TO: Using sort-ing / paging on GridView w/o a DataSourceControlDataSource”, and (c) Data growth of an index list.

treat each individual page in a forum as a single object, andoptimize the crawling schedule for each page independently.This leads to the following two problems in forum crawling:

• Incapable of properly predicting the re-crawling sched-ule. As the list-wise structure is missed, a crawlerdoesn’t know which content is indeed new. For exam-ple, as shown in Fig. 1(c), supposing in T1 the indexlist contains two pages while in T2 it has three pageswith some new content added at the beginning of pageP1, a general crawler thinks all the three pages are up-dated. In this case, it will continually re-crawl the pageP2 and P3 although in fact there is no new informationin these pages.

• Incapable of finding newly published discussion threadsin time. In forums, timely updating index lists is verycrucial to discover new discussion threads. However,a general crawler doesn’t know which pages are fromindex lists but just equally treat all the pages. There-fore, it cannot optimize the bandwidth to meet thetimeliness requirement of forum crawling.

Towards a crawler optimized for forum crawling, in thispaper, a list-wise incremental crawling strategy is proposedbased on the mining of the site-level knowledge from target

forums. First, unlike general crawlers which optimize crawl-ing schedule for every individual page, in our strategy wetake a list (either index list or post list) as the basic unit forcrawling schedule optimization. Second, to understand thelist-structures and the types-of-list of a forum, we proposea set of offline mining algorithms which try to first discoverthe site-level knowledge from the target sites and then in-corporate the knowledge to optimize the crawling schedules.The contributions of this paper are briefly highlighted asfollows:

• Site-Level Knowledge Mining. The site-level knowl-edge is this paper mainly refers to the sitemap of a fo-rum site [4, 15]. A sitemap is a graph-based represen-tation of the organization structure of a web forum. Ittells us how many kinds of pages are there in the targetforum and how these pages are linked with each oth-ers. With the instruction of a sitemap, we concatenatepages to reconstruct index lists and post lists.

• List-Wise Schedule Optimization. Based on thereconstructed lists, we first propose an algorithm todiscover the timestamps of each thread, as well as ofeach post in a given thread. With the timestamps,some site-level statistics are then utilized to estimatethe update frequency and longevity for each list. Itshould be noticed that such site-level statistics are ro-bust as they suffer little from the noises existing insome individual pages. At last, a regression model isemployed to effectively combine various statistics topredict the update frequency of each list.

• Bandwidth Control of Different Types of Lists.

As we have argued, timely refreshing of index lists isvery crucial to ensure the timeliness of a forum crawler.However, post lists are dominant in most forums andindex lists are just a small fraction. Thus, the up-dating of post lists consumes most bandwidth and therefreshing of index lists is prone to be delayed. As wecan identify index lists and post lists, we also proposean efficient bandwidth control strategy to balance thediscovering of new content and the updating of existingcontent.

In the experiments, the proposed solution has been evalu-ated on a mirrored data set containing 990,476 pages and5,407,854 individual posts, published from March 1999 toJune 2008. It showed the crawler with our list-wise strategyis 260% faster than some state-of-the-art methods in termsof fetching the new generated content, and meanwhile ourcrawler still ensure a higher coverage ratio.

This paper is organized as follows. Related research ef-forts are reviewed in Section 2. In Section 3, we first brieflydescribe the framework of our approach and then explaineach components in detail. Experimental evaluations arereported in Section 4, and we draw some conclusions in thelast section.

2. RELATED WORKTo the best of our knowledge, little existing work in liter-

atures has systematically investigated the problem of forumincremental crawling. However, there are still some previ-ous work that should be reviewed, as our approaches weremotivated by them.

Some early work in [2,3] first investigated the dynamic weband treated the information as a depreciating commodity.They first introduced the concepts of lifetime and age of apage which is important for measuring the performance of

1376

an incremental crawler. However, they treated every pageequally and ignored the importance and change frequencywhich are also important to an incremental crawler.

Later work improved the above work and optimized thecrawling process by prioritizing the pages. Whether to min-imize the age or to maximize the freshness leads to a varietyof analytical models by investigating different features. Weclassify them into three categories:

(I) How often the content of a page is updated by its owner.Coffman et. al. [7] analyzed the crawling problem theoreti-cally. Cho et. al. [6] proposed several methods based on thepage update frequency. However, most of these methods arebased on the assumption that most web pages change as aPoisson or memoryless process. But the experimental re-sult in [3] showed that most web pages are modified duringthe span of US working hours (between 8 AM and 8 PM,Eastern time, Monday to Friday). This phenomenon is evenmore evident in forum websites because the content are allgenerated by forum users. Edwards et. al. [10] tried to usean adaptive approach to maintaining data on actual changerates for the optimization without the above assumption.However, it still treated each page independently and maywaste bandwidth. Furthermore, we argue that some otherfactors, such as time intervals of the latest posts, are moreimportant in web forums. We will show their importance inthe experiment part.

(II) The importance of each web page. Baeza-Yates et.al. [1] tried to determine the weight of each page based onsome strategies similar to PageRank. Wolf et. al. [16] as-signed the weight of each page based on the embarrassmentmetric of users’ search results. In the user-centric crawl-ing strategy [13], the targets are mined from user queriesto guide the refreshing schedule of a generic search engine;First of all, some pages may have equal importance weightbut different update frequency, thus only measuring the im-portance of each web page is insufficient. Second, bothstatic rank and content importance are useless in web fo-rums. Most pages in web forums are dynamic pages whichare generated using some pre-defined templates. It is veryhard to compute their PageRank-like static rank since thereare medial links among these pages. Furthermore, the con-tent importance measurement is also useless. Once a post isgenerated, this post always exists unless the author deletesit manually. Before we get these post information, it is veryhard to predict their importance. However, once we havetheir content importance information we usually do not needto revisit it anymore. Some work named focused crawling at-tempts to only retrieve web pages that are relevant to somepre-defined topics [5,9,11] or some labeled examples [14] byassigning pages similar to the target page a higher weight.The target descriptions in focused crawling are quite differ-ent in various applications.

(III) The information longevity of each web page. Olstonet. al. [12] introduced the longevity to determine revisit-ing frequency of each web page. However, the informationlongevity in forums is useless since once a post never dis-appears unless being deleted. This is one of the major dif-ferences between general web pages and forum web pages.Moreover, its three generative models are still based on thepoisson distribution and some modified forms.

Realizing the importance of forum data and the challengesin forum crawling, Cai et al. [4, 15] studied how to recon-struct the sitemap of a target web forum, and how to choosean optimal traversal path to crawl informative pages only.However, this work only addressed the problem of fetching

Sitemap

Pages with similar layout

(a) Identifying Indexs and Posts

(b) Prediction Model

(c) Bandwidth Control

Index Model Post Model

Target Forum

FetcherQueue

Index Post

Index record or Post record

Online Crawling Offline Mining

Figure 2: The flowchart of the proposed solution forincremental forum crawling.

as much as possible valuable data, but left the problem ofrefreshing previously downloaded pages [12] untouched.

All the existing methods ignore the tradeoff between dis-covering new threads and refreshing existing threads. Intu-itively, index-of-thread pages should get higher update fre-quency than post-of-thread pages because any users’ updateactivities will change index-of-thread pages. However, be-cause the number of post-of-thread pages is much largerthan index-of-thread pages, index-of-thread pages may beoverwhelmed by a mass of post-of-thread pages even if onlya small percentage of post-of-thread pages are very active.These post-of-thread pages will occupy most bandwidth, andas a result new discussion threads cannot be downloaded ef-ficiently. Unfortunately, few of existing methods has takenthis issue into account. We will show its importance in theexperiment part.

3. OUR SOLUTIONIn this section, we first describe the system overview of

the proposed solution, and then introduce the details of eachcomponent.

3.1 System OverviewThe proposed solution consists of two parts, offline mining

and online crawling, as shown in Fig. 2.The offline mining is to learn the sitemap of a given fo-

rum site with a few pre-sampled pages from the target site.Through analyzing these sampled pages, we can find outhow many kinds of pages are there in that site, and howthese pages are linked with each others [4]. This site-levelknowledge is then employed in the online crawling process.

1377

The online crawling consists of three steps: (a) identify-ing index lists and post lists; (b) predicting the update fre-quency of a list using a regression model; and (c) balancingbandwidth by adjusting the numbers of various lists in thecrawling queue, as shown in the left part of Fig. 2. Given anew crawled page, we first identify if it is an index-of-threadpage or a post-of-thread page according to its layout infor-mation. Since an index list or post list consists of multiplepages, to reconstruct a list, we concatenate correspondingpages following the detected pagination links [15]. Then, foreach reconstructed list, we extract the timestamps of therecords (either posts or thread entities) containing in thatlist. Several statistics are then proposed based on the times-tamps to characterize the growth of a list; and a regressionmodel is trained to predict the arrival time of the next recordof that list. Finally, given a fixed bandwidth, we balance thenumbers of index/post lists, to satisfy the requirements ofboth completeness and timeliness.

3.2 List ReconstructionThe sitemap knowledge is an organization graph of a give

forum site, which is the fundamental component of the pro-posed crawling method in this paper. The details of itsgeneration algorithm can be found in [4, 15], here we justbriefly describe it as follows. To estimate the sitemap, wefirst sample a few pages from the target site. Because thesampling quality is crucial to the whole mining process, todiversity the sampled pages in terms of page layout and toretrieve pages at deep levels, we adopt a combined strategyof breadth-first and depth-first using a double-ended queue.In the implementation, we try to push as many as possibleunseen URLs from each crawled page to the queue, and thenrandomly pop a URL from the front or the end of the queuefor a new sampling. In practice, it was found that samplinga few thousands pages is sufficient to reconstruct the sitemapfor most forum sites. After that, pages with similar layoutstructures are further clustered into groups using the sin-gle linkage algorithm, as marked with blue nodes in Fig. 3.In our approach, we utilize the repetitive regions to charac-terize the layout of each page. Repetitive regions are verypopular in forum pages to present data records stored in adatabase. Considering that two similar pages usually havedifferent numbers of advertisements, images, and even somecomplex sub-structure embedded in user posts, the repet-itive region-based representation is more robust than thewhole DOM tree [17, 18]. Finally, all possible links amongvarious page groups are established, if in the source groupthere is a page having an out-link pointing to another pagein the target group.

After that, a classifier is introduced to identify the node ofindex-of-thread pages or post-of-thread pages in a sitemap.The classifier is built by some simple yet robust features. Forexample, the node of post-of-thread pages always containsthe largest number of pages in sitemap, and the index-of-thread pages should be the parents of post-of-thread pages,as marked with the red rectangle or green rectangle in Fig.3.The classifier is site-independent and can be used in differ-ent forum sites. We then map each individual page intoone of the nodes by evaluate its layout similarity such assim(pnew, Nindex) and sim(pnew, Npost); and δ is the thresh-old for decision. We concatenate all the individual pagesbelonging to the same index list or post list together. Thedetails of this process is described in Algorithm 1. The aboveprocess is fully automatic. With the concatenated index listsor post lists, our crawling method becomes a list-wise strat-

Index-of-Thread

Post-of-Thread

Home Page

Pages with the Same Layout Link

Search Result

Login Portal

Digest

List-of-Board

Pagination Links

Figure 3: An illustration of a sitemap.

Algorithm 1 The algorithm for constructing index lists andpost lists. Suppose sitemap(N,E) is the sitemap knowledgewith corresponding nodes N and edges E, Nindex is the nodeof index-of-thread pages, Npost is the node of post-of-threadpages.

1: Suppose there is a new page pnew.2: Parse the page pnew and get the pagination out-links list

OutLinksnew .3: if sim(pnew, Nindex) < δ then4: for all page pi in Nindex, and OutLinksi is the pagi-

nation out-links list of pi do5: If (1) ∃linkj ∈ OutLinksi and pnew = linkj , or (2)

∃linkk ∈ OutLinksnew and pi = linkk.6: Concatenate pnew into the index list of pi.7: end for8: else if sim(pnew, Npost) < δ then9: for all page pi in Npost, and OutLinksi is the pagi-

nation out-links list of pi do10: If (1) ∃linkj ∈ OutLinksi and pnew = linkj , or (2)

∃linkk ∈ OutLinksnew and pi = linkk.11: Concatenate pnew into the post list of pi.12: end for13: end if

egy rather than page-level strategies used by most existinggeneral crawlers.

3.3 Timestamp ExtractionTo predict the update frequency of a list more accurately,

we need to first extract the time information of each record(thread creation time in index lists or post time in post lists).It is difficult to extract the correct time information due tothe existence of noisy time records in forum pages, such asusers’ registration time, last login time and so on. In Fig.4,the time contents in orange rectangle are all noisy contentwhile only the purple one is the correct time information.

There are three steps to extract the time information.We first get the timestamp candidates whose content is ashort one and contains digit string such as mm-dd-yyyy ordd/mm/yyyy, or some specific words such as Monday, Tues-day, January, February, etc. Second, we align the html ele-ments containing time information into several groups based

1378

Timestamp Noise

Figure 4: Example timestamps in a page.

Algorithm 2 The algorithm of extracting timestamp DOMpath from list L. Suppose tsSet is a set of all timestampcandidates for the pages in L and there are N unique DOMpaths for these candidates in tsSet.

1: for all DOM path pathi, 1 ≤ i ≤ N do2: Set SeqOrd = 0 and RevOrd = 0.3: for all page pj in L do4: Get all cadidates tsListj in pj whose DOM path is

pathi and sort them in their appearance order.5: for all time candidate tck in tsListj , 1 ≤ k ≤ M

and M is the length of tsj do6: if tck is earlier than tck+1 then7: SeqOrd + +8: else if tck is later than tck+1 then9: RevOrd + +

10: end if11: end for12: end for13: Set Orderi = |SeqOrd−RevOrd|

SeqOrd+RevOrd

14: end for15: Let c = argmaxi Orderi

16: Return pathc

on their DOM path since the timestamp in each recordshould have the same DOM path in HTML document. Fi-nally, since the records are generated in sequence, the times-tamps should satisfy a sequential order. This helps filternoisy time records and extract the correct information. Thedetails of this process is described in Algorithm 2.

3.4 Prediction ModelThe latest replies in each thread can be easily found by

revisiting post lists. Similarly, the new discussion threadscan be discovered by revisiting index lists. The problem ofincremental crawling of web forums is converted to how toestimate the update frequency and predict when the nextnew record of a list (post list or index list) arrives. Based onthe timestamp of each record, we propose several featuresto help predict the update frequency and describe them inTable.1.

Furthermore, for index lists, we analyze the average threadactivity in forum sites. We process all the threads and cal-culate their active time by checking the time of the firstpost and the last post in each thread. The result is shownin Fig. 5. The figure represents the percentage of threadswith different active time. From the figure we can see thepercetage of active thread drops significantly when the ac-tive time becomes longer. More than 40% threads keepactive no longer than 24 hours and 70% threads are nolonger than 3 days. This is one of the major reasons whyforum incremental crawling strategy is different from tra-

0.2

0.3

0.4

0.5

cent

age(

%)

0

0.1

0.2

1 2 3 4 5 6 7 8 9 10

Perc

Days

Figure 5: The thread activity analysis.

ditional incremental crawling strategies. Once a thread iscreated, it usually becomes static after a few days whenthere is no discussion activity. Thus we introduce a stateindicator to avoid bandwidth waste. Suppose there is nodiscussion activity for Δtna time since the last post. Wecompute the standard deviation of time interval Δtsd by

Δtsd =√

1/(N − 1) · ∑N−2i=1 (Δti − Δtavg)2, where N is the

number of post records. If Δtna −Δtavg > α ·Δtsd, we mayset ds = 1, otherwise, ds = 0. This factor is for index listsonly.

To combine these factors together, we leverage a linearregression model which is a lightweight computation modeland is efficient for online processing.

F (x) = w0 +N∑

i=1

wi · xi (1)

For each forum site, we train two models Flist(x) andFpost(x) for index lists and post lists separately. The twomodels are kept updated during the crawling process for thenew crawled pages. In practice, we update the two mod-els every two months. By setting x0 = 1, we can get thecorresponding W by Equation 2.

W = (XT · X)−1 · XT · Y (2)

In the crawling process, we predict the new coming recordsof each list by ΔI = (CT − LT )/F (x), where CT is thecurrent time and LT is the last revisit time (by crawler).We use the predicted ΔI to schedule the crawling process.Though a list may contain multiple pages, we do not need torevisit all of them. We sort the pages in each list based on thetimestamps of the records in each page. For an index list, ifthere are Numlist records in each index-of-thread page, andwe only need to revisit the top ΔIlist/Numlist pages, whereΔIlist = (CT − LT )/Flist(x). For a post list, we need torevisit the last page or the new discovered pages.

3.5 Bandwidth ControlAs we have discussed in previous sections, we need to make

a tradeoff between discovering new pages and refreshing ex-isting pages. The number of post-of-thread pages is muchlarger than index-of-thread. Even if only a small percentageof post-of-thread pages are very active, index-of-thread pagesmay be overwhelmed by a mass of post-of-thread pages,which will occupy most bandwidth. In this case, we needto allocate a dedicated bandwidth for post-of-thread pagessince we can only get new discussion threads from them. Weintroduce a hyper bandwidth control strategy to balance thebandwidth between two kinds of list. The ratio between in-dex lists and post lists can be defined as:

ratio ∝ Nindex

NPost(3)

1379

Table 1: The proposed features in prediction model.Feature Description

Δt1, Δt2, Δt3, Δt4, Δt5 In contrast to leveraging the timestamp directly, we leverage the time intervals between each twoconsecutive timestamps which represent the recent update frequency of a list.

Δtavg The average time interval represents the list update history. Because the latest five time intervalsmay be affected by some accidents, the average time interval of consecutive records in a list helps onsmoothing the result and tolerates these accidents.

Δtsite At the beginning of each list, few information can be used in predicting its update frequency. Weleverage the average time interval in the current forum site to approximate its update frequency.

len The length of a list can represent its hotness which may attract more users. Thus it may also affectthe update frequency of this list.

ct1, ct2, ct3, . . . , ct24 The experiment result in [3] also shows the update frequency of web pages is highly dependent onthe current time. We represent the current time in 24 hours by a vector with 24 dimensions. Forexample, we represent 3 PM by setting ct15 = 1 and others to zero.

Δthour Since the update frequency of web pages is highly dependent on daytime/nighttime, we split one dayinto 24 hours and leverage the average time interval of the whole forum site in current time span.

cd1, cd2, cd3, . . . , cd7 Similar to the feature of current time. We also represent the current day in a week by a vector with7 dimensions. For example, we represent Wednesday by setting cd3 = 1 and others to zero.

Δtday Since the update frequency of pages is also dependent on working days/weekend days, we split oneweek into 7 days and leverage the average time interval of the whole forum site in current time span.

s1, s2, s3, . . . , s15 Ideally, we can estimate update frequency of current list by checking similar lists. To achieve thisgoal, we first represent the state of current list via its last five time intervals and then clustering theminto 15 clusters based on their Euclidean distances. For each new list, we assign it to one of the stateswith the smallest Euclidean distance. We represent the 15 states by a vector with 15 dimensions.For example, we represent state 5 by setting s5 = 1 and others to zero.

where given a time period Δt, Nindex is the number of newdiscussion threads within time Δt and NPost is the numberof new posts arriving within time Δt. We use the averageratio in history to balance the bandwidth between two lists.

4. EXPERIMENTSDifferent from previous work which only considered re-

visiting existing pages, in this paper, the scenario we haveconsidered is a real case. The crawler is required to crawla target forum site starting from its portal page. To havea thorough evaluation, a crawling task needs to last for oneyear or even longer so that we can measure the crawlingperformance using different metrics at different periods, forexample, the warming up stage and the stable stage. Thiscreates the need to build a simulation platform to comparedifferent algorithms because the real-time crawling cannotbe repeated for multiple crawling approaches. Before wedescribe the experimental details, we first introduce the ex-perimental setup and the evaluation metrics.

4.1 Experimental SetupTo evaluate the performance of our system on various sit-

uations, 18 forums were selected in diverse categories (in-cluding bicycle, photography, travel, computer technique,and some general forums) in the experiments, as listed inTable 2. The average length of service of 18 sites is about4.08 years.

As we wish to evaluate different methods in a long timeperiod (about one year) under several different bandwidthsituations while we still want the evaluation to be repeatedunder the same environment to fairly compare different ap-proaches. Because it is impractical to repeat a long lastingcrawling process for many times on real sites, we built asimulation platform to facilitate our evaluation. Typically,a forum hosting server organizes forum data using a back-end database, and dynamically generates forum pages usingpage templates. To build the simulation platform for a fo-

Table 2: Web Forums in the ExperimentsId Forum Site Description1 www.avsforum.com Audio and video

2 boards.cruisecritic.com Cruise travel message

3 www.cybertechhelp.com Computer help community

4 www.disboards.com Disney trip planning

5 drupal.org Official website of Drupal

6 www.esportbike.com Sportbike forum

7 web2.flyertalk.com Frequent flyer community

8 www.gpsreview.net GPS devices review

9 www.kawiforums.com Motorcycle forum

10 www.pctools.com Personal computer tools

11 www.photo.net Photography community

12 photography-on-the.net Digital photography

13 forums.photographyreview.com Photography review

14 www.phpbuilder.com PHP programming

15 www.pocketgpsworld.com GPS help, news and review

16 www.railpage.com.au The Australian rail server

17 www.storm2k.org Weather discussion forum

18 forum.virtualtourist.com Travel and vacation advice

rum, we need to first mirror the forum site by downloadingall the forum pages, then parse the forum pages in a re-verse engineering manner and store the parsed forum postsin a database. Thereafter, we can simulate the response to adownloading request by regenerating a forum page accordingto the requested URL address and the crawling time.

More precisely, we first mirrored the 18 forum sites usinga customized commercial search engine crawler. The crawlerwas configured to be domain-limited and depth-unlimited.For each site, the crawler started from its portal page andfollowed any links within that domain, and a unique URLaddress was followed only once. Consequently, the crawlermirrored all the pages up to June 2008 from 18 forum sites.The mirrored dataset contains 990,476 pages and 5,407,854individual posts, from March 1999 to June 2008. Usingmanually wrote data extraction wrappers, we parsed all thepages and stored all the data records in a database.

1380

The basic behavior of this simulation platform is to re-sponse for a URL request associated with a timestamp. Wewrote 18 page generators for 18 forum sites to simulate theresponses to requests. For any requested URL, since all thecorresponding records can be accessed from the database, wecan generate a HTML page with the same contents, layout,and related links as the one in the real site. Here we makean assumption that a post will never be deleted after it isgenerated. Based on this simple yet reasonable assumption,we can easily figure out at any given time whether a recordexists based on its post time and how records are organizedbased on the forum’s layout, and therefore be able to restorea snapshot of a forum site to the given time.

This simulation platform was used in all the followingcrawling experiments, and provided a fair experimental envi-ronment to each crawling approach. Assuming a fixed band-width, each crawler was required to crawl the given forumsite starting its portal page and from the dummy time period2006-01-01 to 2007-01-01.

4.2 Methods Compared in the ExperimentsTo differentiate the advantage of list-wise strategy and the

benefit of bandwidth control, we implemented two variantsof our method: (1) list-wise strategy (LWS); (2) list-wisestrategy + bandwidth control (LWS+BC).

Since the Curve-Fitting policy and Bound-Based policyin [12] are the state-of-the-art approaches and are more rel-evant to our work, we also included them in the experiments.The original Bound-Based policy only crawl existing pages.We have tried our best to adapt the structure-driven-basedapproach to forum crawling by: 1) giving a new discov-ered URL the highest weight; and 2) relaxing the intervalcondition for adjusting refresh period and reference time toaccommodate the high frequent update situation in forumsites.

We also introduced an oracle strategy for comparison. Inthe oracle strategy, every update activity of each page inthe target site is supposed to be known exactly. Given thefixed bandwidth, the oracle policy can choose the pages withmore new valuable information to visit. The oracle strategyis an ideal policy and an upper bound for other crawlers.

4.3 MeasurementsFollowing pervious work, we assume that the costs of vis-

iting different pages are equal, and we measure the band-width cost as the total number of pages which are requiredto crawl in a given time period [12]. To evaluate the overallcrawling performance for each approach, we measure fromthe following three aspects:

• Bandwidth Utilization. If the bandwidth is fixed,bandwidth utilization is an important measure of crawler’scapability. This measurement is used to analyze if thecrawler can make the best use of a limited bandwidth:

B =Inew

IB(4)

where IB is the bandwidth cost defined as the totalnumber of pages crawled in a time unit and Inew is thenumber of pages containing new information comparedto the existing indexed repository.

• Coverage. A crawler needs to balance between fetch-ing new pages and refreshing existing pages. If a crawlerwastes too much bandwidth to refresh previously down-loaded pages, it may not be able to crawl all new re-

quested pages, and vice versa. To measure the issue,we define coverage as follows:

Cov =Icrawl

Iall(5)

where Iall is the measurement of valuable informationexisting in the target site and Icrawl is the measure-ment of valuable information having been downloadedby crawler.

• Timeliness. To measure if we can fetch each post”timely”, we introduce timeliness. Suppose there are Nelements of valuable information which we have down-loaded. The timeliness is defined as:

T =1

N

N∑i=1

Δti (6)

where Δti represents the time period from its updat-ing time to its downloading time. If the element wasupdated three day ago and we downloaded it one dayago, Δti is two day. This value is smaller the better.

In the forum crawling task, different from general websites, once a post is submitted, the post will always existuntil it is manually deleted by the creator or administrator.In this paper, we use the number of unique posts to measurethe valuable information in forum sites. In the definition ofcoverage, Iall should be the number of unique posts existingin the target forum site and Icrawl should be the numberof unique posts having been downloaded. In the definitionof timeliness, Δti is the time period from a post’s creatingtime to its downloading time. If we download a post oneday after the post was created, Δti is one day.

4.4 Warming Up StageTo make a fair evaluation for all crawlers, we require all

the crawlers to start from the portal page of each site witha fixed bandwidth 3000 pages per day. The crawlers beginto crawl pages from the dummy time 2007-01-01 and lastabout one year. To illustrate the performance changes ofall crawlers in different time periods, we calculate the aver-age performance in everyday in terms of the aforementionedthree measurements and present the results in Fig. 6.

From the figure we can see the performance changes ap-parently in the first 100 days and become stable after about120 days. This is due to the so called warming up stage dur-ing which a crawler needs to first mirror the existing pagesand after that it can download new pages. Suppose thereare Pold posts existing before the crawler starts, ΔP newposts will be generated every day and the bandwidth allowsthe crawler to crawl B posts per day. At the dth day, thereare about Pold + ΔP · d posts existing in the target site. Atthe beginning, since Pold � ΔP ·d, the crawler is required todownload almost all posts belonging to Pold. These are allnew valuable information compared to the indexed reposi-tory. This is why in the first 100 days the bandwidth utiliza-tion was approximate to 1, the Coverage increases quicklyand the Bandwidth Utilization decreases quickly. We callthis stage the warming up stage for the crawler. After aboutPold/(B − ΔP ) days, the crawler may have finished down-loading all old posts and begins to only focus on the postsbelonging to ΔP every day and the performance becomesstable.

Whatever refresh strategy a crawler chooses, if it onlyassigns new valuable information with the highest weight,

1381

0 4

0.6

0.8

1dt

h U

tiliz

atio

n

Oracle

LWS+BC

LWS

0

0.2

0.4

0 50 100 150 200 250 300 350

Band

wid

Days

Bound-Based

Curve-Fitting

(a) Bandwidth Utilization

0 4

0.6

0.8

1

over

age

Oracle

LWS+BC

LWS

0

0.2

0.4

0 50 100 150 200 250 300 350

Co

Days

Bound-Based

Curve-Fitting

(b) Coverage

600

900

mel

ines

s

Oracle

LWS+BC

LWS

B d B d

0

300

0 50 100 150 200 250 300 350

Tim

Days

Bound-Based

Curve-Fitting

(c) Timeliness

Figure 6: One year performance of different crawlersin (a) Bandwidth Utilization, (b) Coverage, and (c)Timeliness.

the length of the warming up stage will only depend on thenumber of old posts Pold, new post generation speed ΔP andthe bandwidth B. Furthermore, if the bandwidth B < ΔP ,it means the bandwidth is too small to cover daily generatednew posts. In this case, the crawler may not be able tomirror the old posts unless the forum site stops generatingnew posts.

In general, the oracle method always performs the best inall measurements and acts as an ideal method. Beside theoracle method, the LWS+BC performs significantly betterthan other methods in terms of timeliness. The averagecoverage per day for all methods becomes to 100% after 100days. This is because the bandwidth is enough and thesemethods can archive most historical records. But the per-formance of these methods on fetching daily new recordsare different as shown in timeliness results. For the givenbandwidth, the timeliness of LWS+BC will decrease after100 days. This is because LWS+BC can save enough bufferto catch up the update progress for new posts after it fin-ishes crawling all existing posts. The LWS can just keep thetimeliness stable because it does not have additional band-width to catch up the update progress. The bound-basedand curve-fitting policies get very similar performance. Thetimeliness of them all increases (note that the smaller thebetter for the timeliness measure) since they cannot fetchnew posts timely and thus delay the downloading of newposts. We will analyze them in next section.

0.1

0.15

idth

Util

izat

ion Oracle

Curve-Fitting

Bound-Based

LWS

0

0.05

1000 2000 3000 4000 5000 6000

Band

wi

Bandwidth (pages/per day)

LWS

LWS+BC

(a) Bandwidth Utilization

0.99

0.995

1

over

age

Oracle

Curve-Fitting

Bound-Based

0.98

0.985

1000 2000 3000 4000 5000 6000

Co


LWS

LWS+BC

(b) Coverage

400

600

mel

ines

s

Oracle

Curve-Fitting

Bound-Based

LWS

0

200

1000 2000 3000 4000 5000 6000

Ti


LWS

LWS+BC

(c) Timeliness

Figure 7: The performance of different crawlers fordifferent bandwidth in (a) Bandwidth Utilization,(b) Coverage, and (c) Timeliness.

4.5 Comparison with different methodsA crawler for a commercial search engine is usually not

allowed to frequently restart. The performance after itswarming-up stage is thus more meaningful. We evaluated allmethods under various bandwidth conditions and all crawlerswere required to start from the portal page of the given site.The crawlers start from the dummy time 2006-01-01 and lastabout one year. We only calculate the average performanceof different methods for the last 90 days from 2006-10-01 to2006-12-31 and present the results in Fig. 7.

The curve-fitting policy and bound-based policy performsimilarly on timeliness and bandwidth utilization while curve-fitting policy performs slightly better on coverage. This isconsistent with their original results in [12].

LWS is better than curve-fitting policy and bound-basedpolicy on all measurements, because we can estimate the up-date frequency more accurately with list-wise information.Furthermore, we can also avoid visiting all the pages in alist and thus save considerable bandwidth.

LWS+BC is the best policy which further improves theperformance apparently compared to LWS. Though the av-erage coverage of different methods seems very close (from0.98 to 1.0), the actual gap is relatively large when it mul-tiplies 1 million pages or 5 million posts. And the gap maybecome even larger when we continue to crawl more sitesor the bandwidth becomes more limited, as shown in the

1382

Table 3: The performance differences between indexlists and post lists. We use “BU” as “BandwidthUtilization” here for short.Methods Index lists Post lists

BU C T BU C TBB 0.0013 0.9925 171 0.0173 0.9917 170CF 0.0012 0.9936 167 0.0174 0.9931 166LWS 0.0025 0.9971 73 0.0276 0.9943 81LWS+BC 0.0030 0.9989 28 0.0329 0.9949 65Oracle 0.0061 1 2 0.0415 0.9990 16

trends of Fig. 7(b). Although LWS can estimate the updatefrequency for index lists and post lists relatively more accu-rately, it is still very hard to balance these two kinds of pages.In contrast, the bandwidth control policy is more effectiveto balance between fetching valuable data and refreshingdownloaded pages. Such a policy only slightly affects post-of-thread pages but benefits the index-of-thread pages a lot.When the bandwidth is set to 3000 pages per day, the aver-age timeliness for LWS+BC is about 65 minutes while theaverage timeliness for bound-based policy or curve-fittingpolicy is about 170 minutes and 165 minutes respectively.This shows that LWS+BC is 260% faster than the bound-based policy or curve-fitting policy and thus is capable ofdownloading new posts timely. Meanwhile, it also achievea high coverage ratio compared to other methods. To getmore insights to this problem, we evaluate these two kindsof pages separately in the next section.

4.6 Index Lists and Post ListsGiven a fixed bandwidth of crawling 3000 pages per day,

we evaluated the performance for index lists and post listsseparately and showed the results in Table 3.

From the table, we can see that LWS improves the perfor-mance for both two kinds of pages. The timeliness of bothindex-of-thread pages and post-of-thread pages decreases sig-nificantly in Table 3 compared to the curve-fitting policy andbound-based policy. This is because LWS leverages more in-formation and can estimate the update frequency for bothtwo kinds of pages more accurately.

LWS+BC further improves the performance for index-of-thread pages compared to LWS. The timeliness of index-of-thread pages decreases significantly in Table 3 while thetimeliness of post-of-thread pages remains the same. It con-firms our previous assumption that bandwidth control canassign the right ratio according to the real update numbersfor different kinds of pages.

5. CONCLUSIONRealizing the importance of forum data and the challenges

in forum crawling, in this paper, we proposed a list-wisestrategy for incremental crawling of web forums. Instead oftreating each individual page independently, as did in mostexisting methods, we have made two improvements. First,we analyze each index list or post list as a whole by concate-nating multiple pages belong to this list together. And thenwe take into account user behavior-related statistics, for ex-ample, the number of records and the timestamp of eachrecord. Such information is of great help for developing anefficient recrawl strategy. Second, we balance discoveringnew pages and refreshing existing pages by introducing abandwidth control policy for index lists and post lists. Toevaluate the proposed crawling strategy, we conducted ex-tensive experiments on 18 forums, compared it with several

state-of-the-art methods, and evaluated it under various sit-uations, including different stages through one year’s crawl-ing simulation, different bandwidths, and different kinds ofpages. Experimental results show that our method outper-forms state-of-the-art methods in terms of bandwidth uti-lization, coverage, and timeliness in all situations. The newstrategy is 260% faster than existing methods and mean-while it also achieves a high coverage ratio.

6. REFERENCES[1] R. Baeza-Yates, C. Castillo, M. Marin, and

A. Rodriguez. Crawling a country: Better strategiesthan breadth-first for web page ordering. In Proc. ofWWW, 2005.

[2] B. E. Brewington and G. Cybenko. How dynamic isthe web? Computer Networks, 2000.

[3] B. E. Brewington and G. Cybenko. Keeping up withthe changing web. Cumputer, 2000.

[4] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang.iRobot: An intelligent crawler for Web forums. InProc. of WWW, pages 447–456, April 2008.

[5] S. Chakrabarti, M. van den Berg, and B. Dom.Focused crawling: A new approach to topic-specificweb resource discovery. Computer Networks, 1999.

[6] J. Cho and H. Garcia-Molina. Effective page refreshpolicies for web crawlers. ACM Transactions onDatabase Systems, 28(4), 2003.

[7] E. Coffman, Z. Liu, and R. R. Weber. Optimal robotscheduling of web search engines. Journal ofscheduling, 1, 1998.

[8] G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun.Finding question-answer pairs from online forums. InProc. of SIGIR, pages 467–474, July 2008.

[9] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles,and M. Gori. Focused crawling using content graphs.In Proc. of VLDB, 2000.

[10] J. Edwards, K. S. McCurley, and J. A. Tomlin. Anadaptive model for optimizing performance of anincremental web crawler. In Proc. of WWW, 2001.

[11] F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz.Evaluating topic-driven web crawlers. In Proc. ofSIGIR, 2001.

[12] C. Olston and S. Pandey. Recrawl scheduling based oninformation longevity. In Proc. of WWW, pages437–446, April 2008.

[13] S. Pandey and C. Olston. User-centric web crawling.In Proc. of WWW, 2005.

[14] M. L. A. Vidal, A. S. da Silva, E. S. de Moura, andJ. M. B. Cavalcanti. Structure-driven crawlergeneration by example. In Proc. of SIGIR, 2006.

[15] Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma. Exploring traversal strategy for Web forumcrawling. In Proc. of SIGIR, pages 459–466, July 2008.

[16] J. Wolf, M. Squillante, P. Yu, J. Sethuraman, andL. Ozsen. Optimal crawling strategies for web searchengines. In Proc. of WWW, 2002.

[17] Y. Zhai and B. Liu. Structured data extraction fromthe web based on partial tree alignment. IEEE Trans.Knowl. Data Eng, 18(12), 2006.

[18] S. Zheng, R. Song, J.-R. Wen, and D. Wu. Jointoptimization of wrapper generation and templatedetection. In Proc. of SIGKDD, 2007.

1383

incorporating site-level knowledge for incremental ...€¦ · incorporating site-level knowledge...

Documents