tech report nwu-cs-05-08: a system for indexing and archiving rss feeds
TRANSCRIPT
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
1/28
Computer Science Department
Technical Report
NWU-CS-05-08
June 6, 2005
RAIn: A System for Indexing and Archiving RSS Feeds
Jeff Cousens and Brian Dennis
Abstract
Really Simple Syndication, or RSS, provides a way for users to monitor a web site forchanges. One of the most popular uses of RSS is to syndicate a web log. RAIn, for RSSArchiver and Indexer, is a system for monitoring and archiving RSS feeds, and for
indexing their contents. This report provides a discussion of the design and
implementation of RAIn. The report also includes a summary of RAIns results over a
two-week period, illustrating both how a small, low-end system is capable of monitoringa significant number of feeds and the types of statistics RAIn is capable of producing.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
2/28
2
Keywords: Really Simple Syndication, RSS Feed Crawler, RSS Feed Statistics, RSS
Feed Indexing, Python
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
3/28
3
Table of Contents
INTRODUCTION ...............................................................................................................................................4OVERVIEW ......................................................................................................................................................... 4DESIGNING A FEEDCRAWLER ................................................................................................................... 5
FETCHERS...........................................................................................................................................................6The FeedFetcher ..........................................................................................................................................7The CurlFetcher and SharedCurlFetcher ..................................................................................................7
FEED NEWNESS ..................................................................................................................................................7ARCHIVING ........................................................................................................................................................ 8INDEXING ...........................................................................................................................................................8QUERYING ........................................................................................................................................................10
The FeedSearcher ......................................................................................................................................10The FeedAnalyzer ......................................................................................................................................11
FEED DISCOVERY ............................................................................................................................................11STORAGE ..........................................................................................................................................................12
Schema........................................................................................................................................................12
DESIGN DECISIONS AND LESSONS LEARNED .................................................................................................14EXPERIENCES USING RAIN .......................................................................................................................16
EXPERIMENTAL SETUP ....................................................................................................................................16ANALYSIS.........................................................................................................................................................17
CONCLUSIONS ................................................................................................................................................25REFERENCES...................................................................................................................................................27
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
4/28
4
Introduction
The Internet is changing writing. Fifteen years ago, one had to convince a publisher to
accept a manuscript before one could become a real author. Publications were limited to
books, magazines and papers. Writing was something tangible, something that required
effort and overhead to produce and distribute. Then came the World Wide Web. Anyonecould create a home page. Companies like Tripod and Geocities enabled everyone to get
web space for free and publish anything they wanted. There was no longer any editorial
approval or need to sell copies.
Recently, interest in publishing on the Web has led to an explosion in the popularity of
web logs, or blogs. Blogs make it easy to maintain repeatedly updated sites. People now
are creating electronic diaries for the entire world to read. Where once everyone had ahome page, now everyone has a blog. Some authors post infrequently and personally,
while others take their blogs very seriously and professionally. In 2004, blogs played a
significant role in the US Presidential Election. At both the Democratic and the
Republican National Conventions, bloggers stood beside traditional journalists. Providingreal-time coverage via wireless devices, these blogs were the main source of coverage of
the conventions for many people [8].
The blogging phenomenon is still relatively young. While some sites exist to gather blogstatistics, they often keep the information close, only revealing a very small subset of the
information gathered. There are millions of blogs out there [14], and little is known about
them. How do people use blogs? What are their posting habits? More interesting are the
stories told by these blogs. What hot news topic is everyone discussing? When did theystart?
RAIn was created to help answer these questions. RAIn is a software system capable ofcollecting information about hundreds of thousands of blogs, allowing us to examine thebehavior of a large community of blogs. The system was designed to analyze the
behavior of communities of feeds, not blogging on the whole. While capable of handling
hundreds of thousands of blogs, it was not designed to compete with sites like Technorati[15] or Syndic8 [2] that attempt to perform exhaustive monitoring of every blog in the
blogosphere the world of web logs. The system is fairly lightweight, permitting it to be
run on commodity hardware; modular, enabling components to be changed or extended;
and flexible enough to be adapted to a wide variety of queries.
Overview
Really Simple Syndication, or RSS, is a lightweight XML based method for sharing web
content. It provides a low-bandwidth way for users to watch a web site for changes. As
blogging has exploded in popularity, so has RSS. All major blogging packages includesupport for syndication using RSS. RSS comes in two common versions: RSS 0.9x [10]
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
5/28
5
and RSS 2.0.x [17]. Depending upon the implementation, an RSS feed may contain
anything from a list of headlines with brief summaries to the full contents of a blogsarticles. A blogs RSS 2.0.1 feed might look like:
Technology at Harvard Lawhttp://blogs.law.harvard.edu/tech/Internet technology hosted by Berkman
Center.Tue, 04 Jan 2005 04:00:00 GMTRSS Usage Skyrockets in the U.S.http://blogs.law.harvard.edu/tech/2005/01/04#a821Six million Americans get news and information from RSS
aggregators, according to a nationwide telephone survey conducted by the Pew Internet and
American Life Project in November.Rogers Cadenhead
Minimally, an RSS feed is a sequence of loosely structured items. RSSs ease of use andpopularity has led to a syndication ecology, where readers monitor a sites RSS feed
instead of the site itself.
With RAIn, the goal was to create a system for monitoring, archiving and analyzing RSSfeeds. The system is designed to be modular, permitting new components to be added or
various components to be changed out. This is true both in the design of the objects and
packages and in the way that RAIn leverages its usage of Python [16], the high-level
interpreted, object-oriented language RAIn was written in. The system is lightweight,requiring only inexpensive hardware to monitor hundreds of thousands of feeds on a
daily basis.
Designing a FeedCrawler
RAIn is a modular system, consisting of an engine to manage feeds to be crawled andmodular components to fetch the feeds, archive the results and index the contents, as well
as components for answering queries and finding new feeds.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
6/28
6
Figure 1: A diagram of RAIn's architecture
Figure 1 shows RAIns architecture. The crawler determines which feeds need to be
crawled, and creates a fetch thunk, an executable object containing a fetcher, an archiver,
an indexer and connections to the Internet and database. The fetch thunks are then put ina crawl pool. As threads become available, fetch thunks in the crawl pool are executed.
The fetch thunk thread fetches the feed from the Internet and, if updated, archives and
indexes it.
Every feed monitored by RAIn is stored in a database table. This table includes a variety
of information about the feed, including the URL to check, the time and result of the last
check, a status count, the time of the next check and information about the last fetch of
the feed, including the HTTP ETag and Last-Modified headers for determining feednewness, if available, and an MD5 digest of the feed. On a user defined interval, RAIn
checks to see if the database contains feeds that have yet to be checked or stale feeds.
Feeds are considered stale when the next check time has passed. URLs of feeds to be
checked are retrieved by the core module and dispatched to worker threads.
Fetchers
A Fetcher handles the HTTP retrieval and processing of a feed. It is responsible for
determining whether the feed has changed, updating the status of the feed and the feed
metadata, and archiving and indexing the feed, if necessary. Three different classes of
Fetchers were created:
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
7/28
7
The FeedFetcher
The FeedFetcher is the most basic Fetcher module. In addition to a simple set of Fetcher
routines, the FeedFetcher module also contains routines common to all Fetchers. It iswritten using only stock Python routines and uses Pythons urllib2 module to retrieve the
feeds. This allows the FeedFetcher module to be used on any system where Python is
available without any dependence on non-standard modules that might not be available
on all platforms and versions. It contains a reasonable amount of intelligence to attemptto avoid overloading servers. However, the urllib2 handles are not reusable, so every feed
crawled requires a new handle to be instantiated.
The CurlFetcher and SharedCurlFetcher
The CurlFetcher is an enhanced Fetcher module. It uses pycurl [6], a Python wrapper tolibcurl. libcurl is a highly optimized C library for network operations. It implementsfeatures like caching the results of Domain Name System (DNS) queries and reusable
handles, allowing for improved performance over Pythons urllib2. The CurlFetcher can
be used in two different ways: per fetch (CurlFetcher) or per thread (SharedCurlFetcher).
Per fetch, the CurlFetcher is similar to the FeedFetcher in that every feed crawledrequires a new handle to be instantiated. Per thread, the SharedCurlFetcher creates one
handle per thread when the FeedCrawler module is started. This saves the Fetcher the
overhead of having to instantiate a new pycurl handle every time a new feed is fetched.
This also allows the pycurl handle to cache DNS information across fetches, reducingnetwork overhead.
The performance differences between the FeedFetcher, CurlFetcher and
SharedCurlFetcher were only briefly examined. As might be expected, theSharedCurlFetcher had the best performance in terms of number of feeds fetched per
minute. There was not an obvious winner between the FeedFetcher and the CurlFetcher.
Feed Newness
Once a feed is retrieved, all Fetchers check to see if the feed has changed using several
different metrics. If the HTTP headers contain an ETag field and the ETag field matches
the previous ETag, or if the header contains a Last-Modified field and the Last-Modifiedfield matches the previous Last-Modified, the feed is considered unchanged. If the feed is
not found to be unchanged by ETag or Last-Modified header, an MD5 digest is taken ofthe entire feed. This is compared against a previous MD5 digest. If the digests are
different, the feed is considered updated. Updated feeds are archived, both in their raw
format for potential future analysis and as individual items.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
8/28
8
Frequency of fetching is adaptive in an attempt to match the feeds change frequency.
The next fetch time is determined by adding a fetch interval to the time the current fetchwas performed. All feeds begin with an interval of 1 hour, which is then modified based
upon the result of the current fetch. If the feed has changed since the last check, the fetch
interval is reduced by 2 hours, to a minimum of 1 hour. If the feed was unchanged, or
there was an error fetching the feed, the fetch interval is increased by 4 hours, to amaximum of 24 hours. The number of errors that occurs when fetching a feed is
recorded; after too many consecutive failures, feeds are marked as removed and no longer
checked.
Adapting to a feeds frequency of change helps RAIn discover new items as they are
posted without overloading a server by repeatedly checking for new items when a feed is
unchanged. A rudimentary analysis showed that adaptive fetching worked: the fetch
interval grew larger for infrequently updated feeds while it stayed small for frequentlyupdated feeds. However, a more in-depth analysis over a longer time period would be
necessary to determine how effective RAIns current implementation is.
Archiving
When a feed is determined to be new, the raw feed is archived in a database table. The
feed is compressed using Pythons zlib module. This is important as feeds, being text,compress to somewhere between 5 and 10% of their original size. The compressed feed is
then inserted into a binary database field, along with the date that the feed was archived.
This raw feed can then be retrieved and uncompressed for later analysis.
The feed is also parsed using Mark Pilgrims Universal Feed Parser [12] into individual
entries in the feed. The items are then individually checked against the database using anMD5 digest to filter out items that have already been processed. New items are serializedusing Pythons pickle module, compressed using Pythons zlib module and stored in a
binary database field. These items can be retrieved, uncompressed and unserialized to
access the raw contents of the item, including the full text of the item.
Information about the items contained within the feed is stored, including the URL of the
item, the time that the item was posted, if available, the time that the item was archived
and an MD5 digest of the item for later comparison. This information can be used to look
for items from a certain site or in a certain date range. It can also be used to retrieve theitem referenced in the feed from the feeds web site.
Indexing
In order to facilitate querying and analyzing the content of the feeds, the words in the
feeds are indexed. This is a daunting task. Assuming the crawler is able to retrieve
200,000 feeds per day, that only 50% of these feeds contain new items and that theaverage item contains 20 words, the crawler will store 2,000,000 words per day. At that
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
9/28
9
pace, in less than two months the crawler will have indexed more words than are
contained in the British National Corpus. In practice, the numbers are significantlyhigher.
The design for indexing is based upon the method used by the WordHoard project at
Northwestern University [11]. One aspect of WordHoard is an interface that allowsliterary scholars to search a corpus of works for words and to generate statistics,
including frequency and counts. This is interesting to literary scholars as it can reveal
patterns in the authors works as well as trends in literature on the whole. In WordHoard,
works of literature are parsed into individual words. The words are stored in two differenttables: a table of individual lemmas, for linguistic analysis across a corpus, and a table of
word occurrences, for analyzing specific instances of a words usage. The word
occurrence table stores complete information about every word and punctuation mark in
every work in a corpus and, with some meta-information such as speakers and act/sceneor page, may be used to reconstruct the entire work, word for word.
Using WordHoard as a model, RAIn splits an RSS feed into individual items, then parseseach item into individual words. Here RAIn departs slightly from WordHoards wordoccurrence model. With RAIn retrieving more than 2,000,000 new words per day, storing
complete word occurrence information for any significant time interval would consume
an unreasonable amount of storage. As a compromise, sentences are stripped of
punctuation and filtered against a list of common stop words. The remaining words arethen aggregated within an item and the distinct filtered words and their counts are stored
in a database.
While words can tell a story with what they say, URLs define relationships. They show
how posts relate to other posts and how sites relate to other sites. Special attention is paid
to URLs in order to track these relationships. When indexing, RAIn looks for commonURL patterns and stores them in a separate table. For this purpose, a URL is consideredto either be a string containing the pattern http:// or the contents of a HTML anchor
elements href attribute.
Using the information in these tables, many different types of queries are possible. Onefamily of queries is that providing general statistical information: e.g. how long is the
average item or what is the average ratio of URLs to words. Blogging is still young and
not much is known about the posting habits of bloggers. Are more people verbose but
infrequent posters or brief but frequent posters? Does time of day impact posting? Howabout day of week? These statistics begin to paint a picture.
With the right constraints, these statistics can even tell a story. For example, by
monitoring a collection of political blogs before, during and after a keystone event (e.g., apartys national convention or the State of the Union), one might be able to tell whether
the event was motivating, discouraging, or even mostly ignored.
More interesting are informational queries: e.g., how did the frequency of a word changeover a given period of time or what are the most commonly used words. A political
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
10/28
10
scientist might wonder how the usage of Schiavo 1 changed during the period from
February through April of 2005. When did people start posting about her? When did theystop? A linguist might wonder what the most commonly used words are, and how this
changes over time. Where 18th century activists wrote books, many 21st century activists
write blogs. Instead of being stored on paper, the snapshot of society todays authors
provide us with is online.
Many blogs now make RSS feeds available for comments as well. By monitoring
comments, one can gauge reader response to certain topics. Which posts engendered the
most comments? Which were largely ignored?
Querying
In order to facilitate analysis, a few different Python modules were created. These
modules use two different approaches. The first is to provide a very generalized module,
which can accommodate a wide variety of queries based upon RAIns indexing. Thesecond is to provide a very specific module, only capable of providing a focused set ofinformation but able to generate it in an optimized way.
The FeedSearcher
The FeedSearcher module provides a generalized interface to the RAIn database. It
allows someone to retrieve a list of items, words or URLs based on a series of constraints,
including date, word count, a set of feeds to search and a pattern, either an exact pattern,a substring or a regular expression. It even allows someone to limit the number of results,
to specify an offset and to get more information about a result, tying a word or URL to anitem, and an item to a feed. The following is an example of using the FeedSearcher to
find the number of updates per day for a single day:
import FeedSearcherfs = FeedSearcher.FeedSearcher(localhost, db, user, pass)fs.type = entriesfs.start_date = 2005-03-15fs.end_date = 2005-03-15results = {}for item in fs.execute():
for details in fs.getDetails(entries, item[feed_id]):if results.has_key(details[feed_url]):
results[details[feed_url]] += 1else:
results[details[feed_url]] = 1
1 Terri Schiavo was a Florida woman in a persistent vegetative state whos right-to-lifevs. right-to-die case became national news in March 2005 [1].
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
11/28
11
This generates a Python hash table, or dictionary, using the feed URLs as keys with the
number of times the feed was updated that day as values. Yet in order to achieve theflexibility of the FeedSearcher, the interface is generic. The FeedSearchers execute
method does not return enough information, so the getDetails method must be invoked on
every item returned by execute. Each call to getDetails involves a SQL query. Thus, in
order to compute this statistic, the total number of SQL queries involved is the number ofitems plus one. For any large data set or complex query, a more optimized interface is
desired.
The FeedAnalyzer
The FeedAnalyzer provides a very specific interface to the RAIn database. It is only
capable of performing a fixed set of queries, but it performs them well and does it muchbetter than the FeedSearcher could. The FeedAnalyzer was written for this report, and
generated the statistics presented in the empirical study. It was designed from the top
down, first looking at the information to present, then determining what queries werenecessary to generate that information. The queries were optimized for the task, whichprovided faster query execution times and reduced the overall number of queries, while
the results were returned in the format required for importing into Excel for analysis. The
following is an example of using the FeedAnalyzer to find the number of updates per day
for a range of days:
import FeedAnalyzerimport mx.DateTimefa = FeedAnalyzer.FeedAnalyzer(
localhost, db, user, passmx.DateTime.ISO.ParseDateTime(2005-03-15),
mx.DateTime.ISO.ParseDateTime(2005-03-31)result = fa.updatesPerDay()
As with the FeedSearcher, this also generates a Python dictionary using the feed URLs as
keys with the number of times the feed was updated that day as values. However, theFeedAnalyzer generates this information using both fewer lines of Python and only one
SQL query.
Feed Discovery
For some data sets, it is necessary to analyze a specific collection of feeds. However,sometimes all that is desired is a large collection of feeds. The FeedFinder module isdesigned to find new feeds to crawl. It is capable of visiting a blog tracking web site,
such as blo.gs [18], and retrieving a list of RSS feed URLs. This list is then parsed by the
FeedFinder and checked against the database (and itself) for duplicate feed URLs. Newfeeds are then added to the database for the FeedCrawler to crawl. This module was
integrated into the FeedCrawler, although it may also be run independently.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
12/28
12
Storage
RAIn currently stores all of its data in a relational database. It leverages Pythons DB-
API, allowing the database to be fairly easily swapped out. Modifications are necessary
only when the schema changes due to differences in data types; e.g., PostgreSQLs bytea
vs. MySQLs longblob; or to accommodate differences in modules support of data types;e.g. pyPgSQLs PgSQL.PgBytea versus psycopgs Binary.
The database for RAIn is PostgreSQL [13]. Initially, pyPgSQL was used as the PythonDB-API interface to PostgreSQL, and PostgreSQL performed well. However, as the
database grew in size, performance fell off. At one point, RAIn was only processing
thousands of feeds per day. To improve performance, the database interface module was
switched from pyPgSQL to psycopg. While both modules provide a DB-API 2.0compliant interface, psycopg was designed to be much faster than other modules. One
important difference between pyPgSQL and psycopg is a bug with psycopg 1 and
Unicode characters. This required a workaround to handle potential Unicode data. In
addition, psycopg required more attention to be paid to transactions so updates would beavailable across all database handles.
Schema
RAIns database consists of six tables:
webfeeds This table contains information about the RSS feeds being monitoredby RAIn. It is updated every time a feed is crawled with information about the
fetch operation. If the feed is changed, http_etag, http_last_modified,
fetch_last_attempt, fetch_next_attempt, fetch_status, fetch_status_count,fetch_interval and fetch_digest are updated. If the feed has not changed since thelast fetch, only fetch_last_attempt, fetch_next_attempt, fetch_status,
fetch_status_count and fetch_interval are updated as the ETag, Last-Modified and
MD5 digest will be unchanged.
webfeeds_archive This table contains a zlib-compressed archive of every newRSS feed RAIn fetches as a binary object (e.g., PostgreSQLs bytea, MySQLs
longblob) in data_bytes. The date that the feed was archived is stored indata_archived. This allows the data to be analyzed using a different methodology
at a later date.
webfeed_items This table contains information about the individual items in thefeeds fetched by RAIn.
webfeed_item_words This table contains all of the words found by theFeedIndexer. The count for each word per item is stored in word_count. Tofacilitate querying, a ts_vector for the word is stored in index_word.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
13/28
13
webfeed_item_urls This table contains all of the URLs found by theFeedIndexer. To facilitate querying, a ts_vector for the URL is stored inindex_url.
webfeed_bundles This table contains information about bundles of RSS feeds,representing a many-to-one relationship between a group of feeds and a bundlename. This relationship allows groups of feeds to be tied together into a single
bundle for analysis.
Figure 2: An entity-relationship diagram for RAIns database
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
14/28
14
In addition to feed information stored in the database, RAIn also stores low-level
information in log files. The logging level may be configured from critical, logging onlyevents that would prevent RAIn from running, to debug, logging almost every operation
RAIn performs.
Design Decisions and Lessons Learned
Storage: By itself, an RSS feed does not represent a significant amount of data; typically
only a couple of KB. However, when processing 200,000 feeds per day, 75% of whichare updated at least once a day, that couple of KB adds up very quickly. Space rapidly
became a concern during development and we had to make several changes as a result.
Most important was implementation of accurate duplicate elimination. The initial design
did not fully incorporate MD5 checksums. This was improved so that both the feed itselfand the items within the feed are now checksummed. The feed is checked against the last
feed to see if it has changed, while the item is checked against all items from that feed to
ensure that it has not already been processed. Adding both of these checksums made asignificant difference: feed checksums doubled the number of feeds marked asunchanged while item checksums reduced the number of items processed by more than
75%. Not only did these changes directly correspond to savings in storage, but they also
allowed significant increases in the number of feeds crawled per day. However, even
with these reductions in the amount of data stored, the amount of data stored is still verysignificant.
Indexing: While indexing is a very simple process to implement, it is very difficult to
make it work well. The average post contains 350 distinct words. The architecture of theword index, while very useful for statistical analyses, requires that each of these words be
its own record. This means that every feed crawled requires an average of 355 databaseinserts. This is a very significant amount of database I/O and comes at a non-negligablecost.
Stop wording is one approach to reducing the amount of data. While initially a basic set
of stop words was used, during analysis it was discovered that additional stop words arenecessary. Depending upon the data set being analyzed, it may be necessary to analyze a
few weeks worth of data to get an adequate feel for which words are important and which
are not. It is also important to consider the goal in using stop words. Some common stop
words may actually be useful for answering certain questions; e.g., to analyze whetherposts about men or women are more common, it would be desirable to have he/she and
his/hers in the database.
The size of the database also impacts performance. At 350 words per item, and 100,000new items per day, the words table would accumulate more than 1 billion words in less
than a month. Even with stop words and duplicate elimination, the size of the words table
is very substantial. This poses its own set of concerns when analyzing or updating the
database. It places constraints on database performance, file system usage and evendatabase design and index usage. As the database grows, queries take longer to process.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
15/28
15
Past a certain point, queries may no longer be performed interactively. Certain types of
queries eventually become impossible.
One solution to this problem is to change the database design. Currently, all of the words
are stored in a single table. Switching to a design where a new table is created every day
would limit the size of the individual tables, keeping search times interactive. Some datasets (e.g., most popular words) could be precomputed and the results stored in a separate,
much smaller table.
Another approach would be only to store frequency information in the database and notto maintain a full text index. The items could be stored as XML and indexed using a
different mechanism, such as Lucene [1], Nutch [7] or XTF [5].
Database Design: Relational databases can provide a great deal of power and allow youto perform some very complicated queries very easily. However, as the size of the
database increases, greater attention must be paid to the design of the database and how it
impacts performance.
The initial design of the stale feeds query used a not equals constraint to filter removed
feeds:
SELECT feed_url, http_etag, http_last_modified,fetch_next_attempt, fetch_interval, fetch_digestFROM webfeedsWHERE now() >= fetch_next_attempt AND fetch_status != 16ORDER BY fetch_next_attempt LIMIT 200;
PostgreSQL executed this constraint as a sequence scan, linear with the number of
records, despite the presence of an index on the feed status. This was not a problem with
a small database, but, as the number of feeds monitored increased, so did the time it tookfor the stale feeds query to execute. This had an impact on the crawlers performance, as
a significant amount of processor time and database IO was lost waiting for this query to
finish. To improve performance, the feed removal process was redesigned. Feeds
continue to be marked as removed but the next fetch time is set for 1,000 years in thefuture. This allows use of a simple date filter, as it is unlikely that RAIn will be used with
a current date of 3005. Since the next fetch field had previously been untouched once the
feed was removed, providing a timestamp for when a feed was removed, a new field was
added to record when a feed is removed.
Relational databases allow the creation of constraints on data, enforcing a schema and
ensuring data integrity. This protection comes at a cost. The initial design included
referential integrity constraints connecting the various tables, ensuring that awebfeed_item and webfeed_archive had a corresponding webfeed, and that a
webfeed_item_url and webfeed_item_word had a corresponding webfeed item. At one
point in development, it was necessary to perform some deletions from the database to
remove duplicate feeds. With referential integrity constraints present, the database needed
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
16/28
16
to perform complex joins and scans in order to process the deletion. Even with indexes in
place, the queries took a significant amount of time. At that point, the decision to includereferential integrity constraints was reevaluated and the constraints were removed,
instead trusting RAIn to accurately insert data.
Data Format and Encoding: One problem when crawling a disparate set of web data isthat it comes in a wide variety of formats. This makes it difficult to predict what data
format a feed will use. Two places this caused problems were in the character encoding
and timestamp format. Most RSS feeds use either ASCII, ISO-8859-1 or UTF-8
encoding, but some use other encodings. Similarly, most HTTP servers use a standardtimestamp format that can be handled by eGenix mxDateTime library [3] but some use
uncommon formats not handled by mxDateTime; e.g., 2005-04-28T12:13:49Z. With
feed finding enabled, it is likely that the database will eventually include some feeds
using unhandled formats. As a result, careful attention must be paid to exceptionhandling. On the data set RAIn was tested with, this happens about 0.005% of the time.
Experiences Using RAIn
A 16-day period at the end of March was selected to empirically investigate RAIns
capabilities. Several hand-selected groups of RSS feeds were added to the database andmonitored, along with approximately 77,000 existing feeds. When the period ended, the
data from those feeds was exported and analyzed for several different statistics.
Experimental Setup
For this experiment, RAIn was run on a dual Pentium III at 933 MHz with 1 GB RAM
running Debian GNU/Linux testing (sarge) with the Debian versions of PostgreSQL(7.4.7-2), Python (2.3.5-1), psycopg (1.1.18-1), pycurl (7.13.0-1) and libcurl3 (7.13.1-2).
The OS and crawler were stored on an internal Ultra-160 SCSI disk. The database was
stored on an external Ultra Wide SCSI attached RAID5 ATA-100 disk array. The server
was connected to the Internet via a 100Mb Ethernet connection and configured to use alocal installation of the DeleGate proxy server (8.9.6) with caching enabled.
In order for analysis to produce meaningful results, the feeds analyzed should berepresentative of something. To that end, six bundles of feeds were defined for analysis,
totaling 723 feeds. These bundles are Computers & Technology (150 feeds),
Entertainment(132 feeds),Eszter Politics (107 feeds), Politics (232 feeds), Sports (52
feeds) and Subscriptions A-list(50 feeds). Four of these (Computers & Technology,Entertainment, Politics and Sports) are the feeds from Yahoos Directorys Weblogs
categories.Eszter Politics is a handpicked collection of political feeds from a colleague in
Northwesterns School of Communications who is studying political blogging.
Subscriptions A-listis the intersection of several different top feeds lists. This allows us
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
17/28
17
to examine the behavior of specific blogging communities (e.g., the political community
or the entertainment community) as well as to compare and contrast differentcommunities.
The database was not purged before data gathering. In addition to the 723 bundled feeds,
there were 77,373 feeds that were not part of any bundle. These feeds were obtained fromseveral different sources, including the list of recently updated feeds on blo.gs and the listof syndicated feeds on Syndic8, two web sites providing centralized monitoring of over
10 million blogs. Some of the statistics analyzed are representative of the database as a
whole, including both bundled and unbundled feeds. The unbundled feeds are not
categorized in any way, other than having been on lists of blogs that were validated andupdated recently. As such, they are representative of blogging on the whole and not any
particular field. This is certainly a useful collection of feeds to consider. However, sites
such as Technorati monitor 10 million blogs and the number of blogs in existence has
been estimated to be over 50 million [14], so this is a very small slice of all blogs.
Analysis was performed while RAIn was still running, using the live database. For somequeries, the live tables could be used. However, for others, especially those involving
words, the live tables are too large to analyze in a reasonable amount of time. For thepurpose of analysis, all of the data for the window being analyzed was exported into
separate tables. This reduced the size of the words tables by more than 99%, making
analyses possible in a reasonable amount of time. Even with these separate, smaller
tables, some of the queries took more than 10 minutes to perform, while the queries tocreate these tables took several hours to complete.
For the duration of our data gathering, RAIn was checking for stale feeds every 60
seconds and looking for a maximum of 400 stale feeds. 30 threads were available in the
threadpool. The hardware was capable of supporting a higher number of threads, and thusprocessing a larger number of stale feeds per minute, but the number of threads was
intentionally kept low to facilitate simultaneous crawling and querying. RAIn wasconstantly busy on a feed set of 78,096 feeds, making approximately 235,000 feed visits
per day.
Analysis
We selected March 15th through March 31st, 2005, as our window to analyze.
Unfortunately, there was an unexplained glitch and the kernel killed the crawler early on
March 23rd. The crawler was restarted on the 25th, but there was some information missedas a result of this failure, and certain aspects of the results are atypical. This has been
indicated where it affects the results.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
18/28
18
Figure 3: Feed status per day
Figure 4: Disk usage per day
Figure 3 shows the performance of the crawler in terms of how many feeds the crawler
was able to visit each day, along with a breakdown of how the feeds were classified.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
19/28
19
Figure 4 shows the size, in KB, that the database grew each day storing the information
about these feeds. It is important to note that these numbers only pertain to feeds butdatabase usage increases both in proportion to the number of updated feeds and the
number of new items in these feeds. The number of new items per feed was unusually
high on March 25th due to the performance problems on the previous days, thus the disk
per feed ratio on that day is not representative of typical usage.
These numbers were obtained from RAIns logs and are representative of the
performance of the system as a whole, including both bundled and unbundled feeds.
From them, one can obtain a feel for RAIns performance. From the total number offeeds, one can estimate that every feed in the database was visited approximately 3 times
per day. The high numbers of updated feeds, coupled with RAIns constant activity, hints
that RAIn was not able to catch updates as they happen, instead catching them hours after
they occurred. The high numbers of unchanged feeds hint that the cap of 24 hours for thefetch interval may need to be increased.
These numbers also highlight the glitch that resulted in the crawler being killed by thekernel, as seen by a marked difference in performance beginning on March 21 st.
Figure 5: The number of updates per day per bundle
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
20/28
20
Figure 5 shows the average number of updates per day per bundle, normalized against the
number of feeds in the bundle.2 One interesting result is that the Subscriptions A-listbundle has a much higher number of posts per day than the other bundles. Membership in
the Subscriptions A-listbundle is based roughly on popularity, not number of updates.
From this, it is possible to conclude that popular blogs are updated more frequently than
other blogs. Another surprising result is that the number of posts per day for theEntertainmentbundle is so low. In this case, it can be concluded that either entertainment
blogs do not post all that frequently or theEntertainmentbundle contained blogs that
were not being updated during the analysis window. All of the numbers are slightly low
due to the complications around the 24th, but they still demonstrate the relativefrequencies between bundles.
Figure 6: The number of updates per day of week per bundle
Figure 6 shows the average number of updates per bundle against the day of the week,normalized against the number of feeds in the bundle.2 Again, the numbers are slightly
low due to the complications around the 24 th. Unlike the previous graph, the relative
frequencies are affected by the complications around the 24th, as the numbers for
Saturday and Sunday are unaffected, the numbers for Monday and Friday only slightlyaffected and the numbers for Tuesday, Wednesday and Thursday significantly affected.
Looking at the frequencies, it appears that there may be an interesting, if perhaps
predictable story about updates and day of week that should be reexamined.
2 Normalization is performed by dividing the number of updates by the number of feedsin the bundle.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
21/28
21
Figure 7: Date vs. time vs. frequency showing only the Sports bundle
Figure 8: Date vs. time vs. frequency showing only the Subscriptions A-listbundle
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
22/28
22
Figures 7 and 8 show the post density against date and time. The size of a bubble isrepresentative of the number of posts in a ten-minute window. These graphs give an idea
of the posting habits of a bundle. On the whole, posting is steady but there are definite
peaks and valleys for some bundles. For example, post density for the Subscriptions A-
listbundle is higher during the period from 11 a.m. to 7 p.m. and lower during the periodfrom 12 a.m. to 6 a.m.. By contrast, the Sports bundle is very scattered, with very
significant peaks throughout the day separated by large valleys. This window
demonstrates fairly typical posting frequency. Deviations from the norm could be used to
discover or pinpoint significant events. This information could be correlated with otherinformation (e.g., popular words) to track the rise and fall of significant media events
(e.g., the 2004 tsunami, the death of the pope). As with the previous graphs, these show
some unusual behavior around the 24th, both sporadic behavior starting on the 21st and an
increased density on the 25th.
Bundle Words per Body Bundle URLs per BodyComputers & Technology 74.917 Computers & Technology 1.018
Entertainment 128.555 Entertainment 1.687
Eszter Politics 79.919 Eszter Politics 1.125
Politics 95.482 Politics 0.658
Sports 67.006 Sports 0.383
Subscriptions A-list 57.980 Subscriptions A-list 1.051
Tables 2 & 3: Words per body (left) and URLs per body (right)
Tables 2 and 3 report average statistics for each bundle in terms of the length of items
and number of URLS mentioned in the body. From these, it can be observed that the
entertainment blogs observed are likely to be lengthy and contain links while sports blogsare likely to contain short posts without links. However, it is important to note that some
blogging packages limit the length of RSS feeds, which may have affected these
numbers.
Bundle Words per Body Factor
Computers & Technology 157.872 2.107
Entertainment 314.966 2.450
Eszter Politics 287.090 3.592
Politics 316.392 3.314Sports 581.732 8.682
Subscriptions A-list 230.321 3.972
Table 4: Words per body containing URLs
Table 4 shows the average number of words per body when the item contains one or
more URLs. The 3rd column contains the factor between items containing URLs and
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
23/28
23
items that do not. In all cases, the average number of words is significantly higher when
an item contains a URL than when it does not, almost nine times higher in the case ofsports blogs. By comparing this information to the information in tables 2 and 3, we can
conclude that a post that contains a URL is likely to contain multiple URLs. This shows
that URLs are uncommon in almost all cases, appearing in somewhere between 10% and
30% of items, on average.
Bundle Local In-Bundle Out-of-Bundle
Computers & Technology 1,690.536 1,177.572 7,131.892
Entertainment 752.025 678.750 8,569.225
Eszter Politics 687.449 1,259.742 8,052.809
Politics 1,100.242 780.478 8,119.281
Sports 638.540 564.424 8,797.035
Subscriptions A-list 523.641 691.602 8,784.757
Table 5: Number of URLs by type
Table 5 is an analysis of the webfeed_item_urls table, examining what people link to.
Local links are either absolute links, pointing to the blogs host, in the case of knownblog hosts, or the blogs domain, in the case of non-blog hosts, or relative links ( i.e., links
that do not contain http://). Blog hosts for this experiment were typepad.com, blogs.com,
blogspot.com and blogdrive.com. In a small number of cases, these links are also non-
http links (e.g., mailto). In-bundle links are http links pointing to other blog hostscontained in the bundle. Out-of-bundle links are all other http links. To facilitate
comparison, all numbers have been normalized per 10,000 URLs.3 These numbers show a
large degree of similarity across bundles, though there is a significant difference between
Computers & Technology and the rest. From these numbers we can conclude that there isa difference in the behavior of technology blogs and others in terms of linking behavior.
Also interesting is the significant difference in ratio between local links and in-bundle
links inEszter Politics and the rest of the bundles. This tells us that theEszter Politicsbundle is a tightly connected bundle, collecting blogs that relate to each other.
Table 6 shows a small sampling of the words used by posts in the Politics bundle over a
four-day period, including their relative frequency per 10,000 words2 and the changefrom the previous day. When viewed over a large window, it is possible to do a post-
mortem of certain events, tracking when they started to gain in popularity and when they
faded into obscurity. The data could also be combined with Kleinbergs techniques for
identifying bursts of words [9] to identify popular events as they are happening.
It is important to note that while the data was filtered against a list of common stop
words, some constant terms (e.g., said, has) remain. For the purposes of searching, some
3 Normalization is performed by taking the number and dividing by the total number of
occurrences, then multiplying by 10,000 to find the frequency per 10,000 items; e.g.,number of times said appears / total number of words * 10,000.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
24/28
24
common terms should be included that are not useful for analysis. When looking for
bursts, or when identifying lists of popular words, a two-pass approach would be best,first building the complete list from the database, then recomputing the list filtering out
common words. One approach for filtering would be to track the change per word over
time and to exclude words that have a small average change over a large window (e.g.,
exclude words with an average change of 2 or less over the past two months).
15-May 16-May 17-May 18-May
Rank Word Freq +/- Word Freq +/- Word Freq +/- Word Freq +/-
1 said 56.943 New said 68.936 0 has 59.085 1 has 59.378 0
2 has 53.093 New has 63.208 0 said 59.000 -1 said 50.393 0
3 about 46.736 New about 43.713 0 who 43.402 1 about 41.897 1
4 who 45.393 New who 39.694 0 about 42.555 -1 who 41.213 -1
5 were 36.440 New will 38.488 1 will 39.079 0 will 36.330 0
6 will 33.754 New been 35.875 6 would 37.553 7 one 35.939 2
7 would 33.038 New were 33.363 -2 out 31.280 5 would 33.791 -1
8 all 32.769 New all 31.956 0 one 30.263 3 been 32.619 2
9 one 32.142 New more 30.348 2 people 30.093 1 more 31.252 4
10 people 31.068 New people 30.147 0 been 30.093 -4 all 30.470 2
11 more 30.531 New one 29.645 -2 were 29.754 -4 out 30.275 -4
12 been 28.740 New out 29.444 1 all 29.415 -4 were 28.126 -1
13 out 28.382 New would 27.333 -6 more 28.737 -4 if 27.247 4
14 no 27.487 New up 26.730 3 up 25.431 0 people 25.197 -5
15 what 27.218 New Bush 25.123 11 Bush 25.346 0 up 25.099 -1
16 so 25.069 New what 23.816 -1 like 23.905 12 can 23.634 14
17 up 24.263 New some 23.414 4 if 23.397 2 our 23.341 2618 if 23.099 New Iraq 22.309 33 what 22.464 -2 what 22.071 0
19 like 23.010 New if 22.108 -1 some 22.464 -2 some 21.583 0
20 can 22.920 New can 21.806 0 into 22.464 7 other 21.583 7
Table 6: Most popular words per day for the Politics bundle
Table 7 shows statistics for the word Schiavo, including the rank among all words in
the Politics bundle for that day, the relative frequency per 10,000 words and the changefrom the previous day. On the 15th, Schiavo was not ranked in the 1,000 most popular
words. As the Schiavo case gained in national attention, the frequency exploded, peakingas the most popular word on the 22nd, as the courts and Congress debated her case. Yet
her name remains one of the most popular words in the Politics bundle up until her death(and likely beyond). This pattern demonstrates both how a word will jump to the top of
the list as a story breaks and also how significant events can be identified and their rise
and fall tracked by looking back through the word lists.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
25/28
25
Date Rank Freq +/-
15-Mar NA NA NA
16-Mar 784 2.211 New
17-Mar 524 2.967 260
18-Mar 43 15.430 481
19-Mar 28 20.438 15
20-Mar 21 21.504 7
21-Mar 23 23.101 -2
22-Mar 1 83.628 22
25-Mar 6 38.832 -5*
26-Mar 5 38.011 1
27-Mar 9 30.973 -4
28-Mar 17 24.174 -8
29-Mar 16 25.095 1
30-Mar 24 20.818 -8
31-Mar 14 26.166 10* difference between 22nd and 25th
Table 7: Rank and frequency for the word Schiavo in the Politics bundle
Conclusions
With RAIn, we have created a framework for monitoring and analyzing RSS feeds. It is
fairly lightweight, requiring only inexpensive hardware. The design is modular, allowing
for the easy replacement of components to either support different functionality orimprove performance on a given system. RAIn is a complete system, including feed
discovery, retrieval, archiving, indexing and a querying interface. It can be pointed at any
site with an RSS feed and will enable archiving of the sites RSS and provides the abilityto search for items based on keywords. More complex queries can be performed to
generate statistical information, either about a site or a group of sites.
We also described some of the statistical analyses possible using RAIn. These range from
simple metrics such as update frequency to complex analyses of the content of the items
contained in feeds. For analysis, several bundles were defined. Each bundle contains a
handpicked set of RSS feeds representing a particular blogging community. We were
able to generate several different sets of statistical information about the blogs containedin the bundles as well as the other 77,373 blogs in RAIns database.
Based on our experiences with RAIn, the system proved to be very capable. Inexpensivehardware supported processing more than 200,000 feeds per day. More expensive
hardware, or a cluster of inexpensive hardware, should be capable of processing a
significantly larger number of feeds. Despite claims that there may be more than 50
million blogs worldwide, it is likely that there are significantly fewer that are actively
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
26/28
26
updated. A larger RAIn installation may be able to compete with sites like Technorati and
Syndic8, monitoring a significant percentage of the active blogs in the world.
Building upon the statistics, more complex analyses are possible. RAIn could easily be
used as the basis of a word monitoring system. Simple word-burst techniques could be
applied to watch for sudden changes in a word, finding significant events as they arehappening. In times of crisis, the Internet has proven to be the fastest source of news and
information time and again. By actively monitoring RSS feeds, it may be possible to
become aware of significant events before they hit the mainstream media.
RAIn also accumulates a substantial amount of content from blogs. On top of the
statistical indexing methods currently in use, full-text indexing methods could be applied
to create a blog search engine. Combined with existing search technologies like Lucene,
Nutch or XTF, the content could be easily indexed and searched, providing a verysubstantial searchable archive of blogs.
RAIn proved to be very capable, monitoring a significant number of feeds on inexpensivehardware. More important, RAIn proved to be very flexible and adaptable. Asconfigured, a large number of statistical analyses can be performed on RAIns data.
However, with RAIns raw data, any statistical analysis possible can be performed; one
need only write the module to do it.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
27/28
27
References
[1] Apache Software Foundation, The. Apache Lucene. May 10, 2005.
[2] Barr, Jeff and Bill Kearney. Syndic8. May 10, 2005.
[3] Lemburg, Marc-Andr. mxDateTime Date and Time types for Python. May 10,
2005.
[4] Goodnough, Abby and Maria Newman. Supreme Court Rejects Request to Reinsert
Feeding Tube. New York Times March 24, 2005. May 10 2005.
[5] Hastings, Kirk and Martin Haye. XTF (eXtensible Text Framework). May 10,
2005.
[6] Jacobsen, Kjetil and Markus Oberhumer. PycURL Home Page. April 6, 2005. May
10, 2005.
[7] Khare, Rohit, Doug Cutting, Kragen Sitaker and Adam Rifkin. Nutch: A Flexibleand Scalable Open-Source Web Search Engine. CommerceNet Labs Technical Report
#04-04. May 10, 2005.
[8] Klam, Matthew. Fear and Laptops on the Campaign Trail. New York TimesSeptember 26, 2004. May 10, 2005.
[9] Kleinberg, Jon. Bursty and Hierarchical Structure in Streams. Proceedings of 8 th
SIGKDD July 2002.
[10] Libby, Dan. RSS 0.91 Spec, revision 3. July 10, 1999. May 10, 2005.
[11] Mueller, Martin. The WordHoard Project. April, 2005. May 10, 2005.
[12] Pilgrim, Mark. Universal Feed Parser. May 10, 2005.
[13] PostgreSQL Global Development Group. PostgreSQL: The worlds most advanced
open source database. May 10, 2005.
-
8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds
28/28
[14] Riley, Duncan. Number of blogs now exceeds 50 million worldwide. The Blog
Herald April 14, 2005. May 10, 2005.
[15] Sifry, David et al. Technorati. May 10, 2005.
[16] van Rossum, Guido et al. Python Programming Language. May 10, 2005.
[17] Winer, Dave. RSS 2.0 Specification. January 30, 2005. May 10, 2005.
[18] Winstead Jr., Jim. blo.gs. May 10, 2005.