tech report nwu-cs-05-08: a system for indexing and archiving rss feeds

8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

1/28

Computer Science Department

Technical Report

NWU-CS-05-08

June 6, 2005

RAIn: A System for Indexing and Archiving RSS Feeds

Jeff Cousens and Brian Dennis

Abstract

Really Simple Syndication, or RSS, provides a way for users to monitor a web site forchanges. One of the most popular uses of RSS is to syndicate a web log. RAIn, for RSSArchiver and Indexer, is a system for monitoring and archiving RSS feeds, and for

indexing their contents. This report provides a discussion of the design and

implementation of RAIn. The report also includes a summary of RAIns results over a

two-week period, illustrating both how a small, low-end system is capable of monitoringa significant number of feeds and the types of statistics RAIn is capable of producing.


2/28

2

Keywords: Really Simple Syndication, RSS Feed Crawler, RSS Feed Statistics, RSS

Feed Indexing, Python


3/28

3

Table of Contents

INTRODUCTION ...............................................................................................................................................4OVERVIEW ......................................................................................................................................................... 4DESIGNING A FEEDCRAWLER ................................................................................................................... 5

FETCHERS...........................................................................................................................................................6The FeedFetcher ..........................................................................................................................................7The CurlFetcher and SharedCurlFetcher ..................................................................................................7

FEED NEWNESS ..................................................................................................................................................7ARCHIVING ........................................................................................................................................................ 8INDEXING ...........................................................................................................................................................8QUERYING ........................................................................................................................................................10

The FeedSearcher ......................................................................................................................................10The FeedAnalyzer ......................................................................................................................................11

FEED DISCOVERY ............................................................................................................................................11STORAGE ..........................................................................................................................................................12

Schema........................................................................................................................................................12

DESIGN DECISIONS AND LESSONS LEARNED .................................................................................................14EXPERIENCES USING RAIN .......................................................................................................................16

EXPERIMENTAL SETUP ....................................................................................................................................16ANALYSIS.........................................................................................................................................................17

CONCLUSIONS ................................................................................................................................................25REFERENCES...................................................................................................................................................27


4/28

4

Introduction

The Internet is changing writing. Fifteen years ago, one had to convince a publisher to

accept a manuscript before one could become a real author. Publications were limited to

books, magazines and papers. Writing was something tangible, something that required

effort and overhead to produce and distribute. Then came the World Wide Web. Anyonecould create a home page. Companies like Tripod and Geocities enabled everyone to get

web space for free and publish anything they wanted. There was no longer any editorial

approval or need to sell copies.

Recently, interest in publishing on the Web has led to an explosion in the popularity of

web logs, or blogs. Blogs make it easy to maintain repeatedly updated sites. People now

are creating electronic diaries for the entire world to read. Where once everyone had ahome page, now everyone has a blog. Some authors post infrequently and personally,

while others take their blogs very seriously and professionally. In 2004, blogs played a

significant role in the US Presidential Election. At both the Democratic and the

Republican National Conventions, bloggers stood beside traditional journalists. Providingreal-time coverage via wireless devices, these blogs were the main source of coverage of

the conventions for many people [8].

The blogging phenomenon is still relatively young. While some sites exist to gather blogstatistics, they often keep the information close, only revealing a very small subset of the

information gathered. There are millions of blogs out there [14], and little is known about

them. How do people use blogs? What are their posting habits? More interesting are the

stories told by these blogs. What hot news topic is everyone discussing? When did theystart?

RAIn was created to help answer these questions. RAIn is a software system capable ofcollecting information about hundreds of thousands of blogs, allowing us to examine thebehavior of a large community of blogs. The system was designed to analyze the

behavior of communities of feeds, not blogging on the whole. While capable of handling

hundreds of thousands of blogs, it was not designed to compete with sites like Technorati[15] or Syndic8 [2] that attempt to perform exhaustive monitoring of every blog in the

blogosphere the world of web logs. The system is fairly lightweight, permitting it to be

run on commodity hardware; modular, enabling components to be changed or extended;

and flexible enough to be adapted to a wide variety of queries.

Overview

Really Simple Syndication, or RSS, is a lightweight XML based method for sharing web

content. It provides a low-bandwidth way for users to watch a web site for changes. As

blogging has exploded in popularity, so has RSS. All major blogging packages includesupport for syndication using RSS. RSS comes in two common versions: RSS 0.9x [10]


5/28

5

and RSS 2.0.x [17]. Depending upon the implementation, an RSS feed may contain

anything from a list of headlines with brief summaries to the full contents of a blogsarticles. A blogs RSS 2.0.1 feed might look like:

Technology at Harvard Lawhttp://blogs.law.harvard.edu/tech/Internet technology hosted by Berkman

Center.Tue, 04 Jan 2005 04:00:00 GMTRSS Usage Skyrockets in the U.S.http://blogs.law.harvard.edu/tech/2005/01/04#a821Six million Americans get news and information from RSS

aggregators, according to a nationwide telephone survey conducted by the Pew Internet and

American Life Project in November.Rogers Cadenhead

Minimally, an RSS feed is a sequence of loosely structured items. RSSs ease of use andpopularity has led to a syndication ecology, where readers monitor a sites RSS feed

instead of the site itself.

With RAIn, the goal was to create a system for monitoring, archiving and analyzing RSSfeeds. The system is designed to be modular, permitting new components to be added or

various components to be changed out. This is true both in the design of the objects and

packages and in the way that RAIn leverages its usage of Python [16], the high-level

interpreted, object-oriented language RAIn was written in. The system is lightweight,requiring only inexpensive hardware to monitor hundreds of thousands of feeds on a

daily basis.

Designing a FeedCrawler

RAIn is a modular system, consisting of an engine to manage feeds to be crawled andmodular components to fetch the feeds, archive the results and index the contents, as well

as components for answering queries and finding new feeds.


6/28

6

Figure 1: A diagram of RAIn's architecture

Figure 1 shows RAIns architecture. The crawler determines which feeds need to be

crawled, and creates a fetch thunk, an executable object containing a fetcher, an archiver,

an indexer and connections to the Internet and database. The fetch thunks are then put ina crawl pool. As threads become available, fetch thunks in the crawl pool are executed.

The fetch thunk thread fetches the feed from the Internet and, if updated, archives and

indexes it.

Every feed monitored by RAIn is stored in a database table. This table includes a variety

of information about the feed, including the URL to check, the time and result of the last

check, a status count, the time of the next check and information about the last fetch of

the feed, including the HTTP ETag and Last-Modified headers for determining feednewness, if available, and an MD5 digest of the feed. On a user defined interval, RAIn

checks to see if the database contains feeds that have yet to be checked or stale feeds.

Feeds are considered stale when the next check time has passed. URLs of feeds to be

checked are retrieved by the core module and dispatched to worker threads.

Fetchers

A Fetcher handles the HTTP retrieval and processing of a feed. It is responsible for

determining whether the feed has changed, updating the status of the feed and the feed

metadata, and archiving and indexing the feed, if necessary. Three different classes of

Fetchers were created:


7/28

7

The FeedFetcher

The FeedFetcher is the most basic Fetcher module. In addition to a simple set of Fetcher

routines, the FeedFetcher module also contains routines common to all Fetchers. It iswritten using only stock Python routines and uses Pythons urllib2 module to retrieve the

feeds. This allows the FeedFetcher module to be used on any system where Python is

available without any dependence on non-standard modules that might not be available

on all platforms and versions. It contains a reasonable amount of intelligence to attemptto avoid overloading servers. However, the urllib2 handles are not reusable, so every feed

crawled requires a new handle to be instantiated.

The CurlFetcher and SharedCurlFetcher

The CurlFetcher is an enhanced Fetcher module. It uses pycurl [6], a Python wrapper tolibcurl. libcurl is a highly optimized C library for network operations. It implementsfeatures like caching the results of Domain Name System (DNS) queries and reusable

handles, allowing for improved performance over Pythons urllib2. The CurlFetcher can

be used in two different ways: per fetch (CurlFetcher) or per thread (SharedCurlFetcher).

Per fetch, the CurlFetcher is similar to the FeedFetcher in that every feed crawledrequires a new handle to be instantiated. Per thread, the SharedCurlFetcher creates one

handle per thread when the FeedCrawler module is started. This saves the Fetcher the

overhead of having to instantiate a new pycurl handle every time a new feed is fetched.

This also allows the pycurl handle to cache DNS information across fetches, reducingnetwork overhead.

The performance differences between the FeedFetcher, CurlFetcher and

SharedCurlFetcher were only briefly examined. As might be expected, theSharedCurlFetcher had the best performance in terms of number of feeds fetched per

minute. There was not an obvious winner between the FeedFetcher and the CurlFetcher.

Feed Newness

Once a feed is retrieved, all Fetchers check to see if the feed has changed using several

different metrics. If the HTTP headers contain an ETag field and the ETag field matches

the previous ETag, or if the header contains a Last-Modified field and the Last-Modifiedfield matches the previous Last-Modified, the feed is considered unchanged. If the feed is

not found to be unchanged by ETag or Last-Modified header, an MD5 digest is taken ofthe entire feed. This is compared against a previous MD5 digest. If the digests are

different, the feed is considered updated. Updated feeds are archived, both in their raw

format for potential future analysis and as individual items.


8/28

8

Frequency of fetching is adaptive in an attempt to match the feeds change frequency.

The next fetch time is determined by adding a fetch interval to the time the current fetchwas performed. All feeds begin with an interval of 1 hour, which is then modified based

upon the result of the current fetch. If the feed has changed since the last check, the fetch

interval is reduced by 2 hours, to a minimum of 1 hour. If the feed was unchanged, or

there was an error fetching the feed, the fetch interval is increased by 4 hours, to amaximum of 24 hours. The number of errors that occurs when fetching a feed is

recorded; after too many consecutive failures, feeds are marked as removed and no longer

checked.

Adapting to a feeds frequency of change helps RAIn discover new items as they are

posted without overloading a server by repeatedly checking for new items when a feed is

unchanged. A rudimentary analysis showed that adaptive fetching worked: the fetch

interval grew larger for infrequently updated feeds while it stayed small for frequentlyupdated feeds. However, a more in-depth analysis over a longer time period would be

necessary to determine how effective RAIns current implementation is.

Archiving

When a feed is determined to be new, the raw feed is archived in a database table. The

feed is compressed using Pythons zlib module. This is important as feeds, being text,compress to somewhere between 5 and 10% of their original size. The compressed feed is

then inserted into a binary database field, along with the date that the feed was archived.

This raw feed can then be retrieved and uncompressed for later analysis.

The feed is also parsed using Mark Pilgrims Universal Feed Parser [12] into individual

entries in the feed. The items are then individually checked against the database using anMD5 digest to filter out items that have already been processed. New items are serializedusing Pythons pickle module, compressed using Pythons zlib module and stored in a

binary database field. These items can be retrieved, uncompressed and unserialized to

access the raw contents of the item, including the full text of the item.

Information about the items contained within the feed is stored, including the URL of the

item, the time that the item was posted, if available, the time that the item was archived

and an MD5 digest of the item for later comparison. This information can be used to look

for items from a certain site or in a certain date range. It can also be used to retrieve theitem referenced in the feed from the feeds web site.

Indexing

In order to facilitate querying and analyzing the content of the feeds, the words in the

feeds are indexed. This is a daunting task. Assuming the crawler is able to retrieve

200,000 feeds per day, that only 50% of these feeds contain new items and that theaverage item contains 20 words, the crawler will store 2,000,000 words per day. At that


9/28

9

pace, in less than two months the crawler will have indexed more words than are

contained in the British National Corpus. In practice, the numbers are significantlyhigher.

The design for indexing is based upon the method used by the WordHoard project at

Northwestern University [11]. One aspect of WordHoard is an interface that allowsliterary scholars to search a corpus of works for words and to generate statistics,

including frequency and counts. This is interesting to literary scholars as it can reveal

patterns in the authors works as well as trends in literature on the whole. In WordHoard,

works of literature are parsed into individual words. The words are stored in two differenttables: a table of individual lemmas, for linguistic analysis across a corpus, and a table of

word occurrences, for analyzing specific instances of a words usage. The word

occurrence table stores complete information about every word and punctuation mark in

every work in a corpus and, with some meta-information such as speakers and act/sceneor page, may be used to reconstruct the entire work, word for word.

Using WordHoard as a model, RAIn splits an RSS feed into individual items, then parseseach item into individual words. Here RAIn departs slightly from WordHoards wordoccurrence model. With RAIn retrieving more than 2,000,000 new words per day, storing

complete word occurrence information for any significant time interval would consume

an unreasonable amount of storage. As a compromise, sentences are stripped of

punctuation and filtered against a list of common stop words. The remaining words arethen aggregated within an item and the distinct filtered words and their counts are stored

in a database.

While words can tell a story with what they say, URLs define relationships. They show

how posts relate to other posts and how sites relate to other sites. Special attention is paid

to URLs in order to track these relationships. When indexing, RAIn looks for commonURL patterns and stores them in a separate table. For this purpose, a URL is consideredto either be a string containing the pattern http:// or the contents of a HTML anchor

elements href attribute.

Using the information in these tables, many different types of queries are possible. Onefamily of queries is that providing general statistical information: e.g. how long is the

average item or what is the average ratio of URLs to words. Blogging is still young and

not much is known about the posting habits of bloggers. Are more people verbose but

infrequent posters or brief but frequent posters? Does time of day impact posting? Howabout day of week? These statistics begin to paint a picture.

With the right constraints, these statistics can even tell a story. For example, by

monitoring a collection of political blogs before, during and after a keystone event (e.g., apartys national convention or the State of the Union), one might be able to tell whether

the event was motivating, discouraging, or even mostly ignored.

More interesting are informational queries: e.g., how did the frequency of a word changeover a given period of time or what are the most commonly used words. A political


10/28

10

scientist might wonder how the usage of Schiavo 1 changed during the period from

February through April of 2005. When did people start posting about her? When did theystop? A linguist might wonder what the most commonly used words are, and how this

changes over time. Where 18th century activists wrote books, many 21st century activists

write blogs. Instead of being stored on paper, the snapshot of society todays authors

provide us with is online.

Many blogs now make RSS feeds available for comments as well. By monitoring

comments, one can gauge reader response to certain topics. Which posts engendered the

most comments? Which were largely ignored?

Querying

In order to facilitate analysis, a few different Python modules were created. These

modules use two different approaches. The first is to provide a very generalized module,

which can accommodate a wide variety of queries based upon RAIns indexing. Thesecond is to provide a very specific module, only capable of providing a focused set ofinformation but able to generate it in an optimized way.

The FeedSearcher

The FeedSearcher module provides a generalized interface to the RAIn database. It

allows someone to retrieve a list of items, words or URLs based on a series of constraints,

including date, word count, a set of feeds to search and a pattern, either an exact pattern,a substring or a regular expression. It even allows someone to limit the number of results,

to specify an offset and to get more information about a result, tying a word or URL to anitem, and an item to a feed. The following is an example of using the FeedSearcher to

find the number of updates per day for a single day:

import FeedSearcherfs = FeedSearcher.FeedSearcher(localhost, db, user, pass)fs.type = entriesfs.start_date = 2005-03-15fs.end_date = 2005-03-15results = {}for item in fs.execute():

for details in fs.getDetails(entries, item[feed_id]):if results.has_key(details[feed_url]):

results[details[feed_url]] += 1else:

results[details[feed_url]] = 1

1 Terri Schiavo was a Florida woman in a persistent vegetative state whos right-to-lifevs. right-to-die case became national news in March 2005 [1].


11/28

11

This generates a Python hash table, or dictionary, using the feed URLs as keys with the

number of times the feed was updated that day as values. Yet in order to achieve theflexibility of the FeedSearcher, the interface is generic. The FeedSearchers execute

method does not return enough information, so the getDetails method must be invoked on

every item returned by execute. Each call to getDetails involves a SQL query. Thus, in

order to compute this statistic, the total number of SQL queries involved is the number ofitems plus one. For any large data set or complex query, a more optimized interface is

desired.

The FeedAnalyzer

The FeedAnalyzer provides a very specific interface to the RAIn database. It is only

capable of performing a fixed set of queries, but it performs them well and does it muchbetter than the FeedSearcher could. The FeedAnalyzer was written for this report, and

generated the statistics presented in the empirical study. It was designed from the top

down, first looking at the information to present, then determining what queries werenecessary to generate that information. The queries were optimized for the task, whichprovided faster query execution times and reduced the overall number of queries, while

the results were returned in the format required for importing into Excel for analysis. The

following is an example of using the FeedAnalyzer to find the number of updates per day

for a range of days:

import FeedAnalyzerimport mx.DateTimefa = FeedAnalyzer.FeedAnalyzer(

localhost, db, user, passmx.DateTime.ISO.ParseDateTime(2005-03-15),

mx.DateTime.ISO.ParseDateTime(2005-03-31)result = fa.updatesPerDay()

As with the FeedSearcher, this also generates a Python dictionary using the feed URLs as

keys with the number of times the feed was updated that day as values. However, theFeedAnalyzer generates this information using both fewer lines of Python and only one

SQL query.

Feed Discovery

For some data sets, it is necessary to analyze a specific collection of feeds. However,sometimes all that is desired is a large collection of feeds. The FeedFinder module isdesigned to find new feeds to crawl. It is capable of visiting a blog tracking web site,

such as blo.gs [18], and retrieving a list of RSS feed URLs. This list is then parsed by the

FeedFinder and checked against the database (and itself) for duplicate feed URLs. Newfeeds are then added to the database for the FeedCrawler to crawl. This module was

integrated into the FeedCrawler, although it may also be run independently.


12/28

12

Storage

RAIn currently stores all of its data in a relational database. It leverages Pythons DB-

API, allowing the database to be fairly easily swapped out. Modifications are necessary

only when the schema changes due to differences in data types; e.g., PostgreSQLs bytea

vs. MySQLs longblob; or to accommodate differences in modules support of data types;e.g. pyPgSQLs PgSQL.PgBytea versus psycopgs Binary.

The database for RAIn is PostgreSQL [13]. Initially, pyPgSQL was used as the PythonDB-API interface to PostgreSQL, and PostgreSQL performed well. However, as the

database grew in size, performance fell off. At one point, RAIn was only processing

thousands of feeds per day. To improve performance, the database interface module was

switched from pyPgSQL to psycopg. While both modules provide a DB-API 2.0compliant interface, psycopg was designed to be much faster than other modules. One

important difference between pyPgSQL and psycopg is a bug with psycopg 1 and

Unicode characters. This required a workaround to handle potential Unicode data. In

addition, psycopg required more attention to be paid to transactions so updates would beavailable across all database handles.

Schema

RAIns database consists of six tables:

webfeeds This table contains information about the RSS feeds being monitoredby RAIn. It is updated every time a feed is crawled with information about the

fetch operation. If the feed is changed, http_etag, http_last_modified,

fetch_last_attempt, fetch_next_attempt, fetch_status, fetch_status_count,fetch_interval and fetch_digest are updated. If the feed has not changed since thelast fetch, only fetch_last_attempt, fetch_next_attempt, fetch_status,

fetch_status_count and fetch_interval are updated as the ETag, Last-Modified and

MD5 digest will be unchanged.

webfeeds_archive This table contains a zlib-compressed archive of every newRSS feed RAIn fetches as a binary object (e.g., PostgreSQLs bytea, MySQLs

longblob) in data_bytes. The date that the feed was archived is stored indata_archived. This allows the data to be analyzed using a different methodology

at a later date.

webfeed_items This table contains information about the individual items in thefeeds fetched by RAIn.

webfeed_item_words This table contains all of the words found by theFeedIndexer. The count for each word per item is stored in word_count. Tofacilitate querying, a ts_vector for the word is stored in index_word.


13/28

13

webfeed_item_urls This table contains all of the URLs found by theFeedIndexer. To facilitate querying, a ts_vector for the URL is stored inindex_url.

webfeed_bundles This table contains information about bundles of RSS feeds,representing a many-to-one relationship between a group of feeds and a bundlename. This relationship allows groups of feeds to be tied together into a single

bundle for analysis.

Figure 2: An entity-relationship diagram for RAIns database


14/28

14

In addition to feed information stored in the database, RAIn also stores low-level

information in log files. The logging level may be configured from critical, logging onlyevents that would prevent RAIn from running, to debug, logging almost every operation

RAIn performs.

Design Decisions and Lessons Learned

Storage: By itself, an RSS feed does not represent a significant amount of data; typically

only a couple of KB. However, when processing 200,000 feeds per day, 75% of whichare updated at least once a day, that couple of KB adds up very quickly. Space rapidly

became a concern during development and we had to make several changes as a result.

Most important was implementation of accurate duplicate elimination. The initial design

did not fully incorporate MD5 checksums. This was improved so that both the feed itselfand the items within the feed are now checksummed. The feed is checked against the last

feed to see if it has changed, while the item is checked against all items from that feed to

ensure that it has not already been processed. Adding both of these checksums made asignificant difference: feed checksums doubled the number of feeds marked asunchanged while item checksums reduced the number of items processed by more than

75%. Not only did these changes directly correspond to savings in storage, but they also

allowed significant increases in the number of feeds crawled per day. However, even

with these reductions in the amount of data stored, the amount of data stored is still verysignificant.

Indexing: While indexing is a very simple process to implement, it is very difficult to

make it work well. The average post contains 350 distinct words. The architecture of theword index, while very useful for statistical analyses, requires that each of these words be

its own record. This means that every feed crawled requires an average of 355 databaseinserts. This is a very significant amount of database I/O and comes at a non-negligablecost.

Stop wording is one approach to reducing the amount of data. While initially a basic set

of stop words was used, during analysis it was discovered that additional stop words arenecessary. Depending upon the data set being analyzed, it may be necessary to analyze a

few weeks worth of data to get an adequate feel for which words are important and which

are not. It is also important to consider the goal in using stop words. Some common stop

words may actually be useful for answering certain questions; e.g., to analyze whetherposts about men or women are more common, it would be desirable to have he/she and

his/hers in the database.

The size of the database also impacts performance. At 350 words per item, and 100,000new items per day, the words table would accumulate more than 1 billion words in less

than a month. Even with stop words and duplicate elimination, the size of the words table

is very substantial. This poses its own set of concerns when analyzing or updating the

database. It places constraints on database performance, file system usage and evendatabase design and index usage. As the database grows, queries take longer to process.


15/28

15

Past a certain point, queries may no longer be performed interactively. Certain types of

queries eventually become impossible.

One solution to this problem is to change the database design. Currently, all of the words

are stored in a single table. Switching to a design where a new table is created every day

would limit the size of the individual tables, keeping search times interactive. Some datasets (e.g., most popular words) could be precomputed and the results stored in a separate,

much smaller table.

Another approach would be only to store frequency information in the database and notto maintain a full text index. The items could be stored as XML and indexed using a

different mechanism, such as Lucene [1], Nutch [7] or XTF [5].

Database Design: Relational databases can provide a great deal of power and allow youto perform some very complicated queries very easily. However, as the size of the

database increases, greater attention must be paid to the design of the database and how it

impacts performance.

The initial design of the stale feeds query used a not equals constraint to filter removed

feeds:

SELECT feed_url, http_etag, http_last_modified,fetch_next_attempt, fetch_interval, fetch_digestFROM webfeedsWHERE now() >= fetch_next_attempt AND fetch_status != 16ORDER BY fetch_next_attempt LIMIT 200;

PostgreSQL executed this constraint as a sequence scan, linear with the number of

records, despite the presence of an index on the feed status. This was not a problem with

a small database, but, as the number of feeds monitored increased, so did the time it tookfor the stale feeds query to execute. This had an impact on the crawlers performance, as

a significant amount of processor time and database IO was lost waiting for this query to

finish. To improve performance, the feed removal process was redesigned. Feeds

continue to be marked as removed but the next fetch time is set for 1,000 years in thefuture. This allows use of a simple date filter, as it is unlikely that RAIn will be used with

a current date of 3005. Since the next fetch field had previously been untouched once the

feed was removed, providing a timestamp for when a feed was removed, a new field was

added to record when a feed is removed.

Relational databases allow the creation of constraints on data, enforcing a schema and

ensuring data integrity. This protection comes at a cost. The initial design included

referential integrity constraints connecting the various tables, ensuring that awebfeed_item and webfeed_archive had a corresponding webfeed, and that a

webfeed_item_url and webfeed_item_word had a corresponding webfeed item. At one

point in development, it was necessary to perform some deletions from the database to

remove duplicate feeds. With referential integrity constraints present, the database needed


16/28

16

to perform complex joins and scans in order to process the deletion. Even with indexes in

place, the queries took a significant amount of time. At that point, the decision to includereferential integrity constraints was reevaluated and the constraints were removed,

instead trusting RAIn to accurately insert data.

Data Format and Encoding: One problem when crawling a disparate set of web data isthat it comes in a wide variety of formats. This makes it difficult to predict what data

format a feed will use. Two places this caused problems were in the character encoding

and timestamp format. Most RSS feeds use either ASCII, ISO-8859-1 or UTF-8

encoding, but some use other encodings. Similarly, most HTTP servers use a standardtimestamp format that can be handled by eGenix mxDateTime library [3] but some use

uncommon formats not handled by mxDateTime; e.g., 2005-04-28T12:13:49Z. With

feed finding enabled, it is likely that the database will eventually include some feeds

using unhandled formats. As a result, careful attention must be paid to exceptionhandling. On the data set RAIn was tested with, this happens about 0.005% of the time.

Experiences Using RAIn

A 16-day period at the end of March was selected to empirically investigate RAIns

capabilities. Several hand-selected groups of RSS feeds were added to the database andmonitored, along with approximately 77,000 existing feeds. When the period ended, the

data from those feeds was exported and analyzed for several different statistics.

Experimental Setup

For this experiment, RAIn was run on a dual Pentium III at 933 MHz with 1 GB RAM

running Debian GNU/Linux testing (sarge) with the Debian versions of PostgreSQL(7.4.7-2), Python (2.3.5-1), psycopg (1.1.18-1), pycurl (7.13.0-1) and libcurl3 (7.13.1-2).

The OS and crawler were stored on an internal Ultra-160 SCSI disk. The database was

stored on an external Ultra Wide SCSI attached RAID5 ATA-100 disk array. The server

was connected to the Internet via a 100Mb Ethernet connection and configured to use alocal installation of the DeleGate proxy server (8.9.6) with caching enabled.

In order for analysis to produce meaningful results, the feeds analyzed should berepresentative of something. To that end, six bundles of feeds were defined for analysis,

totaling 723 feeds. These bundles are Computers & Technology (150 feeds),

Entertainment(132 feeds),Eszter Politics (107 feeds), Politics (232 feeds), Sports (52

feeds) and Subscriptions A-list(50 feeds). Four of these (Computers & Technology,Entertainment, Politics and Sports) are the feeds from Yahoos Directorys Weblogs

categories.Eszter Politics is a handpicked collection of political feeds from a colleague in

Northwesterns School of Communications who is studying political blogging.

Subscriptions A-listis the intersection of several different top feeds lists. This allows us


17/28

17

to examine the behavior of specific blogging communities (e.g., the political community

or the entertainment community) as well as to compare and contrast differentcommunities.

The database was not purged before data gathering. In addition to the 723 bundled feeds,

there were 77,373 feeds that were not part of any bundle. These feeds were obtained fromseveral different sources, including the list of recently updated feeds on blo.gs and the listof syndicated feeds on Syndic8, two web sites providing centralized monitoring of over

10 million blogs. Some of the statistics analyzed are representative of the database as a

whole, including both bundled and unbundled feeds. The unbundled feeds are not

categorized in any way, other than having been on lists of blogs that were validated andupdated recently. As such, they are representative of blogging on the whole and not any

particular field. This is certainly a useful collection of feeds to consider. However, sites

such as Technorati monitor 10 million blogs and the number of blogs in existence has

been estimated to be over 50 million [14], so this is a very small slice of all blogs.

Analysis was performed while RAIn was still running, using the live database. For somequeries, the live tables could be used. However, for others, especially those involving

words, the live tables are too large to analyze in a reasonable amount of time. For thepurpose of analysis, all of the data for the window being analyzed was exported into

separate tables. This reduced the size of the words tables by more than 99%, making

analyses possible in a reasonable amount of time. Even with these separate, smaller

tables, some of the queries took more than 10 minutes to perform, while the queries tocreate these tables took several hours to complete.

For the duration of our data gathering, RAIn was checking for stale feeds every 60

seconds and looking for a maximum of 400 stale feeds. 30 threads were available in the

threadpool. The hardware was capable of supporting a higher number of threads, and thusprocessing a larger number of stale feeds per minute, but the number of threads was

intentionally kept low to facilitate simultaneous crawling and querying. RAIn wasconstantly busy on a feed set of 78,096 feeds, making approximately 235,000 feed visits

per day.

Analysis

We selected March 15th through March 31st, 2005, as our window to analyze.

Unfortunately, there was an unexplained glitch and the kernel killed the crawler early on

March 23rd. The crawler was restarted on the 25th, but there was some information missedas a result of this failure, and certain aspects of the results are atypical. This has been

indicated where it affects the results.


18/28

18

Figure 3: Feed status per day

Figure 4: Disk usage per day

Figure 3 shows the performance of the crawler in terms of how many feeds the crawler

was able to visit each day, along with a breakdown of how the feeds were classified.


19/28

19

Figure 4 shows the size, in KB, that the database grew each day storing the information

about these feeds. It is important to note that these numbers only pertain to feeds butdatabase usage increases both in proportion to the number of updated feeds and the

number of new items in these feeds. The number of new items per feed was unusually

high on March 25th due to the performance problems on the previous days, thus the disk

per feed ratio on that day is not representative of typical usage.

These numbers were obtained from RAIns logs and are representative of the

performance of the system as a whole, including both bundled and unbundled feeds.

From them, one can obtain a feel for RAIns performance. From the total number offeeds, one can estimate that every feed in the database was visited approximately 3 times

per day. The high numbers of updated feeds, coupled with RAIns constant activity, hints

that RAIn was not able to catch updates as they happen, instead catching them hours after

they occurred. The high numbers of unchanged feeds hint that the cap of 24 hours for thefetch interval may need to be increased.

These numbers also highlight the glitch that resulted in the crawler being killed by thekernel, as seen by a marked difference in performance beginning on March 21 st.

Figure 5: The number of updates per day per bundle


20/28

20

Figure 5 shows the average number of updates per day per bundle, normalized against the

number of feeds in the bundle.2 One interesting result is that the Subscriptions A-listbundle has a much higher number of posts per day than the other bundles. Membership in

the Subscriptions A-listbundle is based roughly on popularity, not number of updates.

From this, it is possible to conclude that popular blogs are updated more frequently than

other blogs. Another surprising result is that the number of posts per day for theEntertainmentbundle is so low. In this case, it can be concluded that either entertainment

blogs do not post all that frequently or theEntertainmentbundle contained blogs that

were not being updated during the analysis window. All of the numbers are slightly low

due to the complications around the 24th, but they still demonstrate the relativefrequencies between bundles.

Figure 6: The number of updates per day of week per bundle

Figure 6 shows the average number of updates per bundle against the day of the week,normalized against the number of feeds in the bundle.2 Again, the numbers are slightly

low due to the complications around the 24 th. Unlike the previous graph, the relative

frequencies are affected by the complications around the 24th, as the numbers for

Saturday and Sunday are unaffected, the numbers for Monday and Friday only slightlyaffected and the numbers for Tuesday, Wednesday and Thursday significantly affected.

Looking at the frequencies, it appears that there may be an interesting, if perhaps

predictable story about updates and day of week that should be reexamined.

2 Normalization is performed by dividing the number of updates by the number of feedsin the bundle.


21/28

21

Figure 7: Date vs. time vs. frequency showing only the Sports bundle

Figure 8: Date vs. time vs. frequency showing only the Subscriptions A-listbundle


22/28

22

Figures 7 and 8 show the post density against date and time. The size of a bubble isrepresentative of the number of posts in a ten-minute window. These graphs give an idea

of the posting habits of a bundle. On the whole, posting is steady but there are definite

peaks and valleys for some bundles. For example, post density for the Subscriptions A-

listbundle is higher during the period from 11 a.m. to 7 p.m. and lower during the periodfrom 12 a.m. to 6 a.m.. By contrast, the Sports bundle is very scattered, with very

significant peaks throughout the day separated by large valleys. This window

demonstrates fairly typical posting frequency. Deviations from the norm could be used to

discover or pinpoint significant events. This information could be correlated with otherinformation (e.g., popular words) to track the rise and fall of significant media events

(e.g., the 2004 tsunami, the death of the pope). As with the previous graphs, these show

some unusual behavior around the 24th, both sporadic behavior starting on the 21st and an

increased density on the 25th.

Bundle Words per Body Bundle URLs per BodyComputers & Technology 74.917 Computers & Technology 1.018

Entertainment 128.555 Entertainment 1.687

Eszter Politics 79.919 Eszter Politics 1.125

Politics 95.482 Politics 0.658

Sports 67.006 Sports 0.383

Subscriptions A-list 57.980 Subscriptions A-list 1.051

Tables 2 & 3: Words per body (left) and URLs per body (right)

Tables 2 and 3 report average statistics for each bundle in terms of the length of items

and number of URLS mentioned in the body. From these, it can be observed that the

entertainment blogs observed are likely to be lengthy and contain links while sports blogsare likely to contain short posts without links. However, it is important to note that some

blogging packages limit the length of RSS feeds, which may have affected these

numbers.

Bundle Words per Body Factor

Computers & Technology 157.872 2.107

Entertainment 314.966 2.450

Eszter Politics 287.090 3.592

Politics 316.392 3.314Sports 581.732 8.682

Subscriptions A-list 230.321 3.972

Table 4: Words per body containing URLs

Table 4 shows the average number of words per body when the item contains one or

more URLs. The 3rd column contains the factor between items containing URLs and


23/28

23

items that do not. In all cases, the average number of words is significantly higher when

an item contains a URL than when it does not, almost nine times higher in the case ofsports blogs. By comparing this information to the information in tables 2 and 3, we can

conclude that a post that contains a URL is likely to contain multiple URLs. This shows

that URLs are uncommon in almost all cases, appearing in somewhere between 10% and

30% of items, on average.

Bundle Local In-Bundle Out-of-Bundle

Computers & Technology 1,690.536 1,177.572 7,131.892

Entertainment 752.025 678.750 8,569.225

Eszter Politics 687.449 1,259.742 8,052.809

Politics 1,100.242 780.478 8,119.281

Sports 638.540 564.424 8,797.035

Subscriptions A-list 523.641 691.602 8,784.757

Table 5: Number of URLs by type

Table 5 is an analysis of the webfeed_item_urls table, examining what people link to.

Local links are either absolute links, pointing to the blogs host, in the case of knownblog hosts, or the blogs domain, in the case of non-blog hosts, or relative links ( i.e., links

that do not contain http://). Blog hosts for this experiment were typepad.com, blogs.com,

blogspot.com and blogdrive.com. In a small number of cases, these links are also non-

http links (e.g., mailto). In-bundle links are http links pointing to other blog hostscontained in the bundle. Out-of-bundle links are all other http links. To facilitate

comparison, all numbers have been normalized per 10,000 URLs.3 These numbers show a

large degree of similarity across bundles, though there is a significant difference between

Computers & Technology and the rest. From these numbers we can conclude that there isa difference in the behavior of technology blogs and others in terms of linking behavior.

Also interesting is the significant difference in ratio between local links and in-bundle

links inEszter Politics and the rest of the bundles. This tells us that theEszter Politicsbundle is a tightly connected bundle, collecting blogs that relate to each other.

Table 6 shows a small sampling of the words used by posts in the Politics bundle over a

four-day period, including their relative frequency per 10,000 words2 and the changefrom the previous day. When viewed over a large window, it is possible to do a post-

mortem of certain events, tracking when they started to gain in popularity and when they

faded into obscurity. The data could also be combined with Kleinbergs techniques for

identifying bursts of words [9] to identify popular events as they are happening.

It is important to note that while the data was filtered against a list of common stop

words, some constant terms (e.g., said, has) remain. For the purposes of searching, some

3 Normalization is performed by taking the number and dividing by the total number of

occurrences, then multiplying by 10,000 to find the frequency per 10,000 items; e.g.,number of times said appears / total number of words * 10,000.


24/28

24

common terms should be included that are not useful for analysis. When looking for

bursts, or when identifying lists of popular words, a two-pass approach would be best,first building the complete list from the database, then recomputing the list filtering out

common words. One approach for filtering would be to track the change per word over

time and to exclude words that have a small average change over a large window (e.g.,

exclude words with an average change of 2 or less over the past two months).

15-May 16-May 17-May 18-May

Rank Word Freq +/- Word Freq +/- Word Freq +/- Word Freq +/-

1 said 56.943 New said 68.936 0 has 59.085 1 has 59.378 0

2 has 53.093 New has 63.208 0 said 59.000 -1 said 50.393 0

3 about 46.736 New about 43.713 0 who 43.402 1 about 41.897 1

4 who 45.393 New who 39.694 0 about 42.555 -1 who 41.213 -1

5 were 36.440 New will 38.488 1 will 39.079 0 will 36.330 0

6 will 33.754 New been 35.875 6 would 37.553 7 one 35.939 2

7 would 33.038 New were 33.363 -2 out 31.280 5 would 33.791 -1

8 all 32.769 New all 31.956 0 one 30.263 3 been 32.619 2

9 one 32.142 New more 30.348 2 people 30.093 1 more 31.252 4

10 people 31.068 New people 30.147 0 been 30.093 -4 all 30.470 2

11 more 30.531 New one 29.645 -2 were 29.754 -4 out 30.275 -4

12 been 28.740 New out 29.444 1 all 29.415 -4 were 28.126 -1

13 out 28.382 New would 27.333 -6 more 28.737 -4 if 27.247 4

14 no 27.487 New up 26.730 3 up 25.431 0 people 25.197 -5

15 what 27.218 New Bush 25.123 11 Bush 25.346 0 up 25.099 -1

16 so 25.069 New what 23.816 -1 like 23.905 12 can 23.634 14

17 up 24.263 New some 23.414 4 if 23.397 2 our 23.341 2618 if 23.099 New Iraq 22.309 33 what 22.464 -2 what 22.071 0

19 like 23.010 New if 22.108 -1 some 22.464 -2 some 21.583 0

20 can 22.920 New can 21.806 0 into 22.464 7 other 21.583 7

Table 6: Most popular words per day for the Politics bundle

Table 7 shows statistics for the word Schiavo, including the rank among all words in

the Politics bundle for that day, the relative frequency per 10,000 words and the changefrom the previous day. On the 15th, Schiavo was not ranked in the 1,000 most popular

words. As the Schiavo case gained in national attention, the frequency exploded, peakingas the most popular word on the 22nd, as the courts and Congress debated her case. Yet

her name remains one of the most popular words in the Politics bundle up until her death(and likely beyond). This pattern demonstrates both how a word will jump to the top of

the list as a story breaks and also how significant events can be identified and their rise

and fall tracked by looking back through the word lists.


25/28

25

Date Rank Freq +/-

15-Mar NA NA NA

16-Mar 784 2.211 New

17-Mar 524 2.967 260

18-Mar 43 15.430 481

19-Mar 28 20.438 15

20-Mar 21 21.504 7

21-Mar 23 23.101 -2

22-Mar 1 83.628 22

25-Mar 6 38.832 -5*

26-Mar 5 38.011 1

27-Mar 9 30.973 -4

28-Mar 17 24.174 -8

29-Mar 16 25.095 1

30-Mar 24 20.818 -8

31-Mar 14 26.166 10* difference between 22nd and 25th

Table 7: Rank and frequency for the word Schiavo in the Politics bundle

Conclusions

With RAIn, we have created a framework for monitoring and analyzing RSS feeds. It is

fairly lightweight, requiring only inexpensive hardware. The design is modular, allowing

for the easy replacement of components to either support different functionality orimprove performance on a given system. RAIn is a complete system, including feed

discovery, retrieval, archiving, indexing and a querying interface. It can be pointed at any

site with an RSS feed and will enable archiving of the sites RSS and provides the abilityto search for items based on keywords. More complex queries can be performed to

generate statistical information, either about a site or a group of sites.

We also described some of the statistical analyses possible using RAIn. These range from

simple metrics such as update frequency to complex analyses of the content of the items

contained in feeds. For analysis, several bundles were defined. Each bundle contains a

handpicked set of RSS feeds representing a particular blogging community. We were

able to generate several different sets of statistical information about the blogs containedin the bundles as well as the other 77,373 blogs in RAIns database.

Based on our experiences with RAIn, the system proved to be very capable. Inexpensivehardware supported processing more than 200,000 feeds per day. More expensive

hardware, or a cluster of inexpensive hardware, should be capable of processing a

significantly larger number of feeds. Despite claims that there may be more than 50

million blogs worldwide, it is likely that there are significantly fewer that are actively


26/28

26

updated. A larger RAIn installation may be able to compete with sites like Technorati and

Syndic8, monitoring a significant percentage of the active blogs in the world.

Building upon the statistics, more complex analyses are possible. RAIn could easily be

used as the basis of a word monitoring system. Simple word-burst techniques could be

applied to watch for sudden changes in a word, finding significant events as they arehappening. In times of crisis, the Internet has proven to be the fastest source of news and

information time and again. By actively monitoring RSS feeds, it may be possible to

become aware of significant events before they hit the mainstream media.

RAIn also accumulates a substantial amount of content from blogs. On top of the

statistical indexing methods currently in use, full-text indexing methods could be applied

to create a blog search engine. Combined with existing search technologies like Lucene,

Nutch or XTF, the content could be easily indexed and searched, providing a verysubstantial searchable archive of blogs.

RAIn proved to be very capable, monitoring a significant number of feeds on inexpensivehardware. More important, RAIn proved to be very flexible and adaptable. Asconfigured, a large number of statistical analyses can be performed on RAIns data.

However, with RAIns raw data, any statistical analysis possible can be performed; one

need only write the module to do it.


27/28

27

References

[1] Apache Software Foundation, The. Apache Lucene. May 10, 2005.

[2] Barr, Jeff and Bill Kearney. Syndic8. May 10, 2005.

[3] Lemburg, Marc-Andr. mxDateTime Date and Time types for Python. May 10,

2005.

[4] Goodnough, Abby and Maria Newman. Supreme Court Rejects Request to Reinsert

Feeding Tube. New York Times March 24, 2005. May 10 2005.

[5] Hastings, Kirk and Martin Haye. XTF (eXtensible Text Framework). May 10,

2005.

[6] Jacobsen, Kjetil and Markus Oberhumer. PycURL Home Page. April 6, 2005. May

10, 2005.

[7] Khare, Rohit, Doug Cutting, Kragen Sitaker and Adam Rifkin. Nutch: A Flexibleand Scalable Open-Source Web Search Engine. CommerceNet Labs Technical Report

#04-04. May 10, 2005.

[8] Klam, Matthew. Fear and Laptops on the Campaign Trail. New York TimesSeptember 26, 2004. May 10, 2005.

[9] Kleinberg, Jon. Bursty and Hierarchical Structure in Streams. Proceedings of 8 th

SIGKDD July 2002.

[10] Libby, Dan. RSS 0.91 Spec, revision 3. July 10, 1999. May 10, 2005.

[11] Mueller, Martin. The WordHoard Project. April, 2005. May 10, 2005.

[12] Pilgrim, Mark. Universal Feed Parser. May 10, 2005.

[13] PostgreSQL Global Development Group. PostgreSQL: The worlds most advanced

open source database. May 10, 2005.


28/28

[14] Riley, Duncan. Number of blogs now exceeds 50 million worldwide. The Blog

Herald April 14, 2005. May 10, 2005.

[15] Sifry, David et al. Technorati. May 10, 2005.

[16] van Rossum, Guido et al. Python Programming Language. May 10, 2005.

[17] Winer, Dave. RSS 2.0 Specification. January 30, 2005. May 10, 2005.

[18] Winstead Jr., Jim. blo.gs. May 10, 2005.

tech report nwu-cs-05-08: a system for indexing and archiving rss feeds

Documents