tech report nwu-cs-05-08: a system for indexing and archiving rss feeds

Upload: eecsnorthwesternedu

Post on 07-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    1/28

    Computer Science Department

    Technical Report

    NWU-CS-05-08

    June 6, 2005

    RAIn: A System for Indexing and Archiving RSS Feeds

    Jeff Cousens and Brian Dennis

    Abstract

    Really Simple Syndication, or RSS, provides a way for users to monitor a web site forchanges. One of the most popular uses of RSS is to syndicate a web log. RAIn, for RSSArchiver and Indexer, is a system for monitoring and archiving RSS feeds, and for

    indexing their contents. This report provides a discussion of the design and

    implementation of RAIn. The report also includes a summary of RAIns results over a

    two-week period, illustrating both how a small, low-end system is capable of monitoringa significant number of feeds and the types of statistics RAIn is capable of producing.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    2/28

    2

    Keywords: Really Simple Syndication, RSS Feed Crawler, RSS Feed Statistics, RSS

    Feed Indexing, Python

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    3/28

    3

    Table of Contents

    INTRODUCTION ...............................................................................................................................................4OVERVIEW ......................................................................................................................................................... 4DESIGNING A FEEDCRAWLER ................................................................................................................... 5

    FETCHERS...........................................................................................................................................................6The FeedFetcher ..........................................................................................................................................7The CurlFetcher and SharedCurlFetcher ..................................................................................................7

    FEED NEWNESS ..................................................................................................................................................7ARCHIVING ........................................................................................................................................................ 8INDEXING ...........................................................................................................................................................8QUERYING ........................................................................................................................................................10

    The FeedSearcher ......................................................................................................................................10The FeedAnalyzer ......................................................................................................................................11

    FEED DISCOVERY ............................................................................................................................................11STORAGE ..........................................................................................................................................................12

    Schema........................................................................................................................................................12

    DESIGN DECISIONS AND LESSONS LEARNED .................................................................................................14EXPERIENCES USING RAIN .......................................................................................................................16

    EXPERIMENTAL SETUP ....................................................................................................................................16ANALYSIS.........................................................................................................................................................17

    CONCLUSIONS ................................................................................................................................................25REFERENCES...................................................................................................................................................27

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    4/28

    4

    Introduction

    The Internet is changing writing. Fifteen years ago, one had to convince a publisher to

    accept a manuscript before one could become a real author. Publications were limited to

    books, magazines and papers. Writing was something tangible, something that required

    effort and overhead to produce and distribute. Then came the World Wide Web. Anyonecould create a home page. Companies like Tripod and Geocities enabled everyone to get

    web space for free and publish anything they wanted. There was no longer any editorial

    approval or need to sell copies.

    Recently, interest in publishing on the Web has led to an explosion in the popularity of

    web logs, or blogs. Blogs make it easy to maintain repeatedly updated sites. People now

    are creating electronic diaries for the entire world to read. Where once everyone had ahome page, now everyone has a blog. Some authors post infrequently and personally,

    while others take their blogs very seriously and professionally. In 2004, blogs played a

    significant role in the US Presidential Election. At both the Democratic and the

    Republican National Conventions, bloggers stood beside traditional journalists. Providingreal-time coverage via wireless devices, these blogs were the main source of coverage of

    the conventions for many people [8].

    The blogging phenomenon is still relatively young. While some sites exist to gather blogstatistics, they often keep the information close, only revealing a very small subset of the

    information gathered. There are millions of blogs out there [14], and little is known about

    them. How do people use blogs? What are their posting habits? More interesting are the

    stories told by these blogs. What hot news topic is everyone discussing? When did theystart?

    RAIn was created to help answer these questions. RAIn is a software system capable ofcollecting information about hundreds of thousands of blogs, allowing us to examine thebehavior of a large community of blogs. The system was designed to analyze the

    behavior of communities of feeds, not blogging on the whole. While capable of handling

    hundreds of thousands of blogs, it was not designed to compete with sites like Technorati[15] or Syndic8 [2] that attempt to perform exhaustive monitoring of every blog in the

    blogosphere the world of web logs. The system is fairly lightweight, permitting it to be

    run on commodity hardware; modular, enabling components to be changed or extended;

    and flexible enough to be adapted to a wide variety of queries.

    Overview

    Really Simple Syndication, or RSS, is a lightweight XML based method for sharing web

    content. It provides a low-bandwidth way for users to watch a web site for changes. As

    blogging has exploded in popularity, so has RSS. All major blogging packages includesupport for syndication using RSS. RSS comes in two common versions: RSS 0.9x [10]

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    5/28

    5

    and RSS 2.0.x [17]. Depending upon the implementation, an RSS feed may contain

    anything from a list of headlines with brief summaries to the full contents of a blogsarticles. A blogs RSS 2.0.1 feed might look like:

    Technology at Harvard Lawhttp://blogs.law.harvard.edu/tech/Internet technology hosted by Berkman

    Center.Tue, 04 Jan 2005 04:00:00 GMTRSS Usage Skyrockets in the U.S.http://blogs.law.harvard.edu/tech/2005/01/04#a821Six million Americans get news and information from RSS

    aggregators, according to a nationwide telephone survey conducted by the Pew Internet and

    American Life Project in November.Rogers Cadenhead

    Minimally, an RSS feed is a sequence of loosely structured items. RSSs ease of use andpopularity has led to a syndication ecology, where readers monitor a sites RSS feed

    instead of the site itself.

    With RAIn, the goal was to create a system for monitoring, archiving and analyzing RSSfeeds. The system is designed to be modular, permitting new components to be added or

    various components to be changed out. This is true both in the design of the objects and

    packages and in the way that RAIn leverages its usage of Python [16], the high-level

    interpreted, object-oriented language RAIn was written in. The system is lightweight,requiring only inexpensive hardware to monitor hundreds of thousands of feeds on a

    daily basis.

    Designing a FeedCrawler

    RAIn is a modular system, consisting of an engine to manage feeds to be crawled andmodular components to fetch the feeds, archive the results and index the contents, as well

    as components for answering queries and finding new feeds.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    6/28

    6

    Figure 1: A diagram of RAIn's architecture

    Figure 1 shows RAIns architecture. The crawler determines which feeds need to be

    crawled, and creates a fetch thunk, an executable object containing a fetcher, an archiver,

    an indexer and connections to the Internet and database. The fetch thunks are then put ina crawl pool. As threads become available, fetch thunks in the crawl pool are executed.

    The fetch thunk thread fetches the feed from the Internet and, if updated, archives and

    indexes it.

    Every feed monitored by RAIn is stored in a database table. This table includes a variety

    of information about the feed, including the URL to check, the time and result of the last

    check, a status count, the time of the next check and information about the last fetch of

    the feed, including the HTTP ETag and Last-Modified headers for determining feednewness, if available, and an MD5 digest of the feed. On a user defined interval, RAIn

    checks to see if the database contains feeds that have yet to be checked or stale feeds.

    Feeds are considered stale when the next check time has passed. URLs of feeds to be

    checked are retrieved by the core module and dispatched to worker threads.

    Fetchers

    A Fetcher handles the HTTP retrieval and processing of a feed. It is responsible for

    determining whether the feed has changed, updating the status of the feed and the feed

    metadata, and archiving and indexing the feed, if necessary. Three different classes of

    Fetchers were created:

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    7/28

    7

    The FeedFetcher

    The FeedFetcher is the most basic Fetcher module. In addition to a simple set of Fetcher

    routines, the FeedFetcher module also contains routines common to all Fetchers. It iswritten using only stock Python routines and uses Pythons urllib2 module to retrieve the

    feeds. This allows the FeedFetcher module to be used on any system where Python is

    available without any dependence on non-standard modules that might not be available

    on all platforms and versions. It contains a reasonable amount of intelligence to attemptto avoid overloading servers. However, the urllib2 handles are not reusable, so every feed

    crawled requires a new handle to be instantiated.

    The CurlFetcher and SharedCurlFetcher

    The CurlFetcher is an enhanced Fetcher module. It uses pycurl [6], a Python wrapper tolibcurl. libcurl is a highly optimized C library for network operations. It implementsfeatures like caching the results of Domain Name System (DNS) queries and reusable

    handles, allowing for improved performance over Pythons urllib2. The CurlFetcher can

    be used in two different ways: per fetch (CurlFetcher) or per thread (SharedCurlFetcher).

    Per fetch, the CurlFetcher is similar to the FeedFetcher in that every feed crawledrequires a new handle to be instantiated. Per thread, the SharedCurlFetcher creates one

    handle per thread when the FeedCrawler module is started. This saves the Fetcher the

    overhead of having to instantiate a new pycurl handle every time a new feed is fetched.

    This also allows the pycurl handle to cache DNS information across fetches, reducingnetwork overhead.

    The performance differences between the FeedFetcher, CurlFetcher and

    SharedCurlFetcher were only briefly examined. As might be expected, theSharedCurlFetcher had the best performance in terms of number of feeds fetched per

    minute. There was not an obvious winner between the FeedFetcher and the CurlFetcher.

    Feed Newness

    Once a feed is retrieved, all Fetchers check to see if the feed has changed using several

    different metrics. If the HTTP headers contain an ETag field and the ETag field matches

    the previous ETag, or if the header contains a Last-Modified field and the Last-Modifiedfield matches the previous Last-Modified, the feed is considered unchanged. If the feed is

    not found to be unchanged by ETag or Last-Modified header, an MD5 digest is taken ofthe entire feed. This is compared against a previous MD5 digest. If the digests are

    different, the feed is considered updated. Updated feeds are archived, both in their raw

    format for potential future analysis and as individual items.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    8/28

    8

    Frequency of fetching is adaptive in an attempt to match the feeds change frequency.

    The next fetch time is determined by adding a fetch interval to the time the current fetchwas performed. All feeds begin with an interval of 1 hour, which is then modified based

    upon the result of the current fetch. If the feed has changed since the last check, the fetch

    interval is reduced by 2 hours, to a minimum of 1 hour. If the feed was unchanged, or

    there was an error fetching the feed, the fetch interval is increased by 4 hours, to amaximum of 24 hours. The number of errors that occurs when fetching a feed is

    recorded; after too many consecutive failures, feeds are marked as removed and no longer

    checked.

    Adapting to a feeds frequency of change helps RAIn discover new items as they are

    posted without overloading a server by repeatedly checking for new items when a feed is

    unchanged. A rudimentary analysis showed that adaptive fetching worked: the fetch

    interval grew larger for infrequently updated feeds while it stayed small for frequentlyupdated feeds. However, a more in-depth analysis over a longer time period would be

    necessary to determine how effective RAIns current implementation is.

    Archiving

    When a feed is determined to be new, the raw feed is archived in a database table. The

    feed is compressed using Pythons zlib module. This is important as feeds, being text,compress to somewhere between 5 and 10% of their original size. The compressed feed is

    then inserted into a binary database field, along with the date that the feed was archived.

    This raw feed can then be retrieved and uncompressed for later analysis.

    The feed is also parsed using Mark Pilgrims Universal Feed Parser [12] into individual

    entries in the feed. The items are then individually checked against the database using anMD5 digest to filter out items that have already been processed. New items are serializedusing Pythons pickle module, compressed using Pythons zlib module and stored in a

    binary database field. These items can be retrieved, uncompressed and unserialized to

    access the raw contents of the item, including the full text of the item.

    Information about the items contained within the feed is stored, including the URL of the

    item, the time that the item was posted, if available, the time that the item was archived

    and an MD5 digest of the item for later comparison. This information can be used to look

    for items from a certain site or in a certain date range. It can also be used to retrieve theitem referenced in the feed from the feeds web site.

    Indexing

    In order to facilitate querying and analyzing the content of the feeds, the words in the

    feeds are indexed. This is a daunting task. Assuming the crawler is able to retrieve

    200,000 feeds per day, that only 50% of these feeds contain new items and that theaverage item contains 20 words, the crawler will store 2,000,000 words per day. At that

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    9/28

    9

    pace, in less than two months the crawler will have indexed more words than are

    contained in the British National Corpus. In practice, the numbers are significantlyhigher.

    The design for indexing is based upon the method used by the WordHoard project at

    Northwestern University [11]. One aspect of WordHoard is an interface that allowsliterary scholars to search a corpus of works for words and to generate statistics,

    including frequency and counts. This is interesting to literary scholars as it can reveal

    patterns in the authors works as well as trends in literature on the whole. In WordHoard,

    works of literature are parsed into individual words. The words are stored in two differenttables: a table of individual lemmas, for linguistic analysis across a corpus, and a table of

    word occurrences, for analyzing specific instances of a words usage. The word

    occurrence table stores complete information about every word and punctuation mark in

    every work in a corpus and, with some meta-information such as speakers and act/sceneor page, may be used to reconstruct the entire work, word for word.

    Using WordHoard as a model, RAIn splits an RSS feed into individual items, then parseseach item into individual words. Here RAIn departs slightly from WordHoards wordoccurrence model. With RAIn retrieving more than 2,000,000 new words per day, storing

    complete word occurrence information for any significant time interval would consume

    an unreasonable amount of storage. As a compromise, sentences are stripped of

    punctuation and filtered against a list of common stop words. The remaining words arethen aggregated within an item and the distinct filtered words and their counts are stored

    in a database.

    While words can tell a story with what they say, URLs define relationships. They show

    how posts relate to other posts and how sites relate to other sites. Special attention is paid

    to URLs in order to track these relationships. When indexing, RAIn looks for commonURL patterns and stores them in a separate table. For this purpose, a URL is consideredto either be a string containing the pattern http:// or the contents of a HTML anchor

    elements href attribute.

    Using the information in these tables, many different types of queries are possible. Onefamily of queries is that providing general statistical information: e.g. how long is the

    average item or what is the average ratio of URLs to words. Blogging is still young and

    not much is known about the posting habits of bloggers. Are more people verbose but

    infrequent posters or brief but frequent posters? Does time of day impact posting? Howabout day of week? These statistics begin to paint a picture.

    With the right constraints, these statistics can even tell a story. For example, by

    monitoring a collection of political blogs before, during and after a keystone event (e.g., apartys national convention or the State of the Union), one might be able to tell whether

    the event was motivating, discouraging, or even mostly ignored.

    More interesting are informational queries: e.g., how did the frequency of a word changeover a given period of time or what are the most commonly used words. A political

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    10/28

    10

    scientist might wonder how the usage of Schiavo 1 changed during the period from

    February through April of 2005. When did people start posting about her? When did theystop? A linguist might wonder what the most commonly used words are, and how this

    changes over time. Where 18th century activists wrote books, many 21st century activists

    write blogs. Instead of being stored on paper, the snapshot of society todays authors

    provide us with is online.

    Many blogs now make RSS feeds available for comments as well. By monitoring

    comments, one can gauge reader response to certain topics. Which posts engendered the

    most comments? Which were largely ignored?

    Querying

    In order to facilitate analysis, a few different Python modules were created. These

    modules use two different approaches. The first is to provide a very generalized module,

    which can accommodate a wide variety of queries based upon RAIns indexing. Thesecond is to provide a very specific module, only capable of providing a focused set ofinformation but able to generate it in an optimized way.

    The FeedSearcher

    The FeedSearcher module provides a generalized interface to the RAIn database. It

    allows someone to retrieve a list of items, words or URLs based on a series of constraints,

    including date, word count, a set of feeds to search and a pattern, either an exact pattern,a substring or a regular expression. It even allows someone to limit the number of results,

    to specify an offset and to get more information about a result, tying a word or URL to anitem, and an item to a feed. The following is an example of using the FeedSearcher to

    find the number of updates per day for a single day:

    import FeedSearcherfs = FeedSearcher.FeedSearcher(localhost, db, user, pass)fs.type = entriesfs.start_date = 2005-03-15fs.end_date = 2005-03-15results = {}for item in fs.execute():

    for details in fs.getDetails(entries, item[feed_id]):if results.has_key(details[feed_url]):

    results[details[feed_url]] += 1else:

    results[details[feed_url]] = 1

    1 Terri Schiavo was a Florida woman in a persistent vegetative state whos right-to-lifevs. right-to-die case became national news in March 2005 [1].

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    11/28

    11

    This generates a Python hash table, or dictionary, using the feed URLs as keys with the

    number of times the feed was updated that day as values. Yet in order to achieve theflexibility of the FeedSearcher, the interface is generic. The FeedSearchers execute

    method does not return enough information, so the getDetails method must be invoked on

    every item returned by execute. Each call to getDetails involves a SQL query. Thus, in

    order to compute this statistic, the total number of SQL queries involved is the number ofitems plus one. For any large data set or complex query, a more optimized interface is

    desired.

    The FeedAnalyzer

    The FeedAnalyzer provides a very specific interface to the RAIn database. It is only

    capable of performing a fixed set of queries, but it performs them well and does it muchbetter than the FeedSearcher could. The FeedAnalyzer was written for this report, and

    generated the statistics presented in the empirical study. It was designed from the top

    down, first looking at the information to present, then determining what queries werenecessary to generate that information. The queries were optimized for the task, whichprovided faster query execution times and reduced the overall number of queries, while

    the results were returned in the format required for importing into Excel for analysis. The

    following is an example of using the FeedAnalyzer to find the number of updates per day

    for a range of days:

    import FeedAnalyzerimport mx.DateTimefa = FeedAnalyzer.FeedAnalyzer(

    localhost, db, user, passmx.DateTime.ISO.ParseDateTime(2005-03-15),

    mx.DateTime.ISO.ParseDateTime(2005-03-31)result = fa.updatesPerDay()

    As with the FeedSearcher, this also generates a Python dictionary using the feed URLs as

    keys with the number of times the feed was updated that day as values. However, theFeedAnalyzer generates this information using both fewer lines of Python and only one

    SQL query.

    Feed Discovery

    For some data sets, it is necessary to analyze a specific collection of feeds. However,sometimes all that is desired is a large collection of feeds. The FeedFinder module isdesigned to find new feeds to crawl. It is capable of visiting a blog tracking web site,

    such as blo.gs [18], and retrieving a list of RSS feed URLs. This list is then parsed by the

    FeedFinder and checked against the database (and itself) for duplicate feed URLs. Newfeeds are then added to the database for the FeedCrawler to crawl. This module was

    integrated into the FeedCrawler, although it may also be run independently.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    12/28

    12

    Storage

    RAIn currently stores all of its data in a relational database. It leverages Pythons DB-

    API, allowing the database to be fairly easily swapped out. Modifications are necessary

    only when the schema changes due to differences in data types; e.g., PostgreSQLs bytea

    vs. MySQLs longblob; or to accommodate differences in modules support of data types;e.g. pyPgSQLs PgSQL.PgBytea versus psycopgs Binary.

    The database for RAIn is PostgreSQL [13]. Initially, pyPgSQL was used as the PythonDB-API interface to PostgreSQL, and PostgreSQL performed well. However, as the

    database grew in size, performance fell off. At one point, RAIn was only processing

    thousands of feeds per day. To improve performance, the database interface module was

    switched from pyPgSQL to psycopg. While both modules provide a DB-API 2.0compliant interface, psycopg was designed to be much faster than other modules. One

    important difference between pyPgSQL and psycopg is a bug with psycopg 1 and

    Unicode characters. This required a workaround to handle potential Unicode data. In

    addition, psycopg required more attention to be paid to transactions so updates would beavailable across all database handles.

    Schema

    RAIns database consists of six tables:

    webfeeds This table contains information about the RSS feeds being monitoredby RAIn. It is updated every time a feed is crawled with information about the

    fetch operation. If the feed is changed, http_etag, http_last_modified,

    fetch_last_attempt, fetch_next_attempt, fetch_status, fetch_status_count,fetch_interval and fetch_digest are updated. If the feed has not changed since thelast fetch, only fetch_last_attempt, fetch_next_attempt, fetch_status,

    fetch_status_count and fetch_interval are updated as the ETag, Last-Modified and

    MD5 digest will be unchanged.

    webfeeds_archive This table contains a zlib-compressed archive of every newRSS feed RAIn fetches as a binary object (e.g., PostgreSQLs bytea, MySQLs

    longblob) in data_bytes. The date that the feed was archived is stored indata_archived. This allows the data to be analyzed using a different methodology

    at a later date.

    webfeed_items This table contains information about the individual items in thefeeds fetched by RAIn.

    webfeed_item_words This table contains all of the words found by theFeedIndexer. The count for each word per item is stored in word_count. Tofacilitate querying, a ts_vector for the word is stored in index_word.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    13/28

    13

    webfeed_item_urls This table contains all of the URLs found by theFeedIndexer. To facilitate querying, a ts_vector for the URL is stored inindex_url.

    webfeed_bundles This table contains information about bundles of RSS feeds,representing a many-to-one relationship between a group of feeds and a bundlename. This relationship allows groups of feeds to be tied together into a single

    bundle for analysis.

    Figure 2: An entity-relationship diagram for RAIns database

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    14/28

    14

    In addition to feed information stored in the database, RAIn also stores low-level

    information in log files. The logging level may be configured from critical, logging onlyevents that would prevent RAIn from running, to debug, logging almost every operation

    RAIn performs.

    Design Decisions and Lessons Learned

    Storage: By itself, an RSS feed does not represent a significant amount of data; typically

    only a couple of KB. However, when processing 200,000 feeds per day, 75% of whichare updated at least once a day, that couple of KB adds up very quickly. Space rapidly

    became a concern during development and we had to make several changes as a result.

    Most important was implementation of accurate duplicate elimination. The initial design

    did not fully incorporate MD5 checksums. This was improved so that both the feed itselfand the items within the feed are now checksummed. The feed is checked against the last

    feed to see if it has changed, while the item is checked against all items from that feed to

    ensure that it has not already been processed. Adding both of these checksums made asignificant difference: feed checksums doubled the number of feeds marked asunchanged while item checksums reduced the number of items processed by more than

    75%. Not only did these changes directly correspond to savings in storage, but they also

    allowed significant increases in the number of feeds crawled per day. However, even

    with these reductions in the amount of data stored, the amount of data stored is still verysignificant.

    Indexing: While indexing is a very simple process to implement, it is very difficult to

    make it work well. The average post contains 350 distinct words. The architecture of theword index, while very useful for statistical analyses, requires that each of these words be

    its own record. This means that every feed crawled requires an average of 355 databaseinserts. This is a very significant amount of database I/O and comes at a non-negligablecost.

    Stop wording is one approach to reducing the amount of data. While initially a basic set

    of stop words was used, during analysis it was discovered that additional stop words arenecessary. Depending upon the data set being analyzed, it may be necessary to analyze a

    few weeks worth of data to get an adequate feel for which words are important and which

    are not. It is also important to consider the goal in using stop words. Some common stop

    words may actually be useful for answering certain questions; e.g., to analyze whetherposts about men or women are more common, it would be desirable to have he/she and

    his/hers in the database.

    The size of the database also impacts performance. At 350 words per item, and 100,000new items per day, the words table would accumulate more than 1 billion words in less

    than a month. Even with stop words and duplicate elimination, the size of the words table

    is very substantial. This poses its own set of concerns when analyzing or updating the

    database. It places constraints on database performance, file system usage and evendatabase design and index usage. As the database grows, queries take longer to process.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    15/28

    15

    Past a certain point, queries may no longer be performed interactively. Certain types of

    queries eventually become impossible.

    One solution to this problem is to change the database design. Currently, all of the words

    are stored in a single table. Switching to a design where a new table is created every day

    would limit the size of the individual tables, keeping search times interactive. Some datasets (e.g., most popular words) could be precomputed and the results stored in a separate,

    much smaller table.

    Another approach would be only to store frequency information in the database and notto maintain a full text index. The items could be stored as XML and indexed using a

    different mechanism, such as Lucene [1], Nutch [7] or XTF [5].

    Database Design: Relational databases can provide a great deal of power and allow youto perform some very complicated queries very easily. However, as the size of the

    database increases, greater attention must be paid to the design of the database and how it

    impacts performance.

    The initial design of the stale feeds query used a not equals constraint to filter removed

    feeds:

    SELECT feed_url, http_etag, http_last_modified,fetch_next_attempt, fetch_interval, fetch_digestFROM webfeedsWHERE now() >= fetch_next_attempt AND fetch_status != 16ORDER BY fetch_next_attempt LIMIT 200;

    PostgreSQL executed this constraint as a sequence scan, linear with the number of

    records, despite the presence of an index on the feed status. This was not a problem with

    a small database, but, as the number of feeds monitored increased, so did the time it tookfor the stale feeds query to execute. This had an impact on the crawlers performance, as

    a significant amount of processor time and database IO was lost waiting for this query to

    finish. To improve performance, the feed removal process was redesigned. Feeds

    continue to be marked as removed but the next fetch time is set for 1,000 years in thefuture. This allows use of a simple date filter, as it is unlikely that RAIn will be used with

    a current date of 3005. Since the next fetch field had previously been untouched once the

    feed was removed, providing a timestamp for when a feed was removed, a new field was

    added to record when a feed is removed.

    Relational databases allow the creation of constraints on data, enforcing a schema and

    ensuring data integrity. This protection comes at a cost. The initial design included

    referential integrity constraints connecting the various tables, ensuring that awebfeed_item and webfeed_archive had a corresponding webfeed, and that a

    webfeed_item_url and webfeed_item_word had a corresponding webfeed item. At one

    point in development, it was necessary to perform some deletions from the database to

    remove duplicate feeds. With referential integrity constraints present, the database needed

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    16/28

    16

    to perform complex joins and scans in order to process the deletion. Even with indexes in

    place, the queries took a significant amount of time. At that point, the decision to includereferential integrity constraints was reevaluated and the constraints were removed,

    instead trusting RAIn to accurately insert data.

    Data Format and Encoding: One problem when crawling a disparate set of web data isthat it comes in a wide variety of formats. This makes it difficult to predict what data

    format a feed will use. Two places this caused problems were in the character encoding

    and timestamp format. Most RSS feeds use either ASCII, ISO-8859-1 or UTF-8

    encoding, but some use other encodings. Similarly, most HTTP servers use a standardtimestamp format that can be handled by eGenix mxDateTime library [3] but some use

    uncommon formats not handled by mxDateTime; e.g., 2005-04-28T12:13:49Z. With

    feed finding enabled, it is likely that the database will eventually include some feeds

    using unhandled formats. As a result, careful attention must be paid to exceptionhandling. On the data set RAIn was tested with, this happens about 0.005% of the time.

    Experiences Using RAIn

    A 16-day period at the end of March was selected to empirically investigate RAIns

    capabilities. Several hand-selected groups of RSS feeds were added to the database andmonitored, along with approximately 77,000 existing feeds. When the period ended, the

    data from those feeds was exported and analyzed for several different statistics.

    Experimental Setup

    For this experiment, RAIn was run on a dual Pentium III at 933 MHz with 1 GB RAM

    running Debian GNU/Linux testing (sarge) with the Debian versions of PostgreSQL(7.4.7-2), Python (2.3.5-1), psycopg (1.1.18-1), pycurl (7.13.0-1) and libcurl3 (7.13.1-2).

    The OS and crawler were stored on an internal Ultra-160 SCSI disk. The database was

    stored on an external Ultra Wide SCSI attached RAID5 ATA-100 disk array. The server

    was connected to the Internet via a 100Mb Ethernet connection and configured to use alocal installation of the DeleGate proxy server (8.9.6) with caching enabled.

    In order for analysis to produce meaningful results, the feeds analyzed should berepresentative of something. To that end, six bundles of feeds were defined for analysis,

    totaling 723 feeds. These bundles are Computers & Technology (150 feeds),

    Entertainment(132 feeds),Eszter Politics (107 feeds), Politics (232 feeds), Sports (52

    feeds) and Subscriptions A-list(50 feeds). Four of these (Computers & Technology,Entertainment, Politics and Sports) are the feeds from Yahoos Directorys Weblogs

    categories.Eszter Politics is a handpicked collection of political feeds from a colleague in

    Northwesterns School of Communications who is studying political blogging.

    Subscriptions A-listis the intersection of several different top feeds lists. This allows us

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    17/28

    17

    to examine the behavior of specific blogging communities (e.g., the political community

    or the entertainment community) as well as to compare and contrast differentcommunities.

    The database was not purged before data gathering. In addition to the 723 bundled feeds,

    there were 77,373 feeds that were not part of any bundle. These feeds were obtained fromseveral different sources, including the list of recently updated feeds on blo.gs and the listof syndicated feeds on Syndic8, two web sites providing centralized monitoring of over

    10 million blogs. Some of the statistics analyzed are representative of the database as a

    whole, including both bundled and unbundled feeds. The unbundled feeds are not

    categorized in any way, other than having been on lists of blogs that were validated andupdated recently. As such, they are representative of blogging on the whole and not any

    particular field. This is certainly a useful collection of feeds to consider. However, sites

    such as Technorati monitor 10 million blogs and the number of blogs in existence has

    been estimated to be over 50 million [14], so this is a very small slice of all blogs.

    Analysis was performed while RAIn was still running, using the live database. For somequeries, the live tables could be used. However, for others, especially those involving

    words, the live tables are too large to analyze in a reasonable amount of time. For thepurpose of analysis, all of the data for the window being analyzed was exported into

    separate tables. This reduced the size of the words tables by more than 99%, making

    analyses possible in a reasonable amount of time. Even with these separate, smaller

    tables, some of the queries took more than 10 minutes to perform, while the queries tocreate these tables took several hours to complete.

    For the duration of our data gathering, RAIn was checking for stale feeds every 60

    seconds and looking for a maximum of 400 stale feeds. 30 threads were available in the

    threadpool. The hardware was capable of supporting a higher number of threads, and thusprocessing a larger number of stale feeds per minute, but the number of threads was

    intentionally kept low to facilitate simultaneous crawling and querying. RAIn wasconstantly busy on a feed set of 78,096 feeds, making approximately 235,000 feed visits

    per day.

    Analysis

    We selected March 15th through March 31st, 2005, as our window to analyze.

    Unfortunately, there was an unexplained glitch and the kernel killed the crawler early on

    March 23rd. The crawler was restarted on the 25th, but there was some information missedas a result of this failure, and certain aspects of the results are atypical. This has been

    indicated where it affects the results.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    18/28

    18

    Figure 3: Feed status per day

    Figure 4: Disk usage per day

    Figure 3 shows the performance of the crawler in terms of how many feeds the crawler

    was able to visit each day, along with a breakdown of how the feeds were classified.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    19/28

    19

    Figure 4 shows the size, in KB, that the database grew each day storing the information

    about these feeds. It is important to note that these numbers only pertain to feeds butdatabase usage increases both in proportion to the number of updated feeds and the

    number of new items in these feeds. The number of new items per feed was unusually

    high on March 25th due to the performance problems on the previous days, thus the disk

    per feed ratio on that day is not representative of typical usage.

    These numbers were obtained from RAIns logs and are representative of the

    performance of the system as a whole, including both bundled and unbundled feeds.

    From them, one can obtain a feel for RAIns performance. From the total number offeeds, one can estimate that every feed in the database was visited approximately 3 times

    per day. The high numbers of updated feeds, coupled with RAIns constant activity, hints

    that RAIn was not able to catch updates as they happen, instead catching them hours after

    they occurred. The high numbers of unchanged feeds hint that the cap of 24 hours for thefetch interval may need to be increased.

    These numbers also highlight the glitch that resulted in the crawler being killed by thekernel, as seen by a marked difference in performance beginning on March 21 st.

    Figure 5: The number of updates per day per bundle

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    20/28

    20

    Figure 5 shows the average number of updates per day per bundle, normalized against the

    number of feeds in the bundle.2 One interesting result is that the Subscriptions A-listbundle has a much higher number of posts per day than the other bundles. Membership in

    the Subscriptions A-listbundle is based roughly on popularity, not number of updates.

    From this, it is possible to conclude that popular blogs are updated more frequently than

    other blogs. Another surprising result is that the number of posts per day for theEntertainmentbundle is so low. In this case, it can be concluded that either entertainment

    blogs do not post all that frequently or theEntertainmentbundle contained blogs that

    were not being updated during the analysis window. All of the numbers are slightly low

    due to the complications around the 24th, but they still demonstrate the relativefrequencies between bundles.

    Figure 6: The number of updates per day of week per bundle

    Figure 6 shows the average number of updates per bundle against the day of the week,normalized against the number of feeds in the bundle.2 Again, the numbers are slightly

    low due to the complications around the 24 th. Unlike the previous graph, the relative

    frequencies are affected by the complications around the 24th, as the numbers for

    Saturday and Sunday are unaffected, the numbers for Monday and Friday only slightlyaffected and the numbers for Tuesday, Wednesday and Thursday significantly affected.

    Looking at the frequencies, it appears that there may be an interesting, if perhaps

    predictable story about updates and day of week that should be reexamined.

    2 Normalization is performed by dividing the number of updates by the number of feedsin the bundle.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    21/28

    21

    Figure 7: Date vs. time vs. frequency showing only the Sports bundle

    Figure 8: Date vs. time vs. frequency showing only the Subscriptions A-listbundle

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    22/28

    22

    Figures 7 and 8 show the post density against date and time. The size of a bubble isrepresentative of the number of posts in a ten-minute window. These graphs give an idea

    of the posting habits of a bundle. On the whole, posting is steady but there are definite

    peaks and valleys for some bundles. For example, post density for the Subscriptions A-

    listbundle is higher during the period from 11 a.m. to 7 p.m. and lower during the periodfrom 12 a.m. to 6 a.m.. By contrast, the Sports bundle is very scattered, with very

    significant peaks throughout the day separated by large valleys. This window

    demonstrates fairly typical posting frequency. Deviations from the norm could be used to

    discover or pinpoint significant events. This information could be correlated with otherinformation (e.g., popular words) to track the rise and fall of significant media events

    (e.g., the 2004 tsunami, the death of the pope). As with the previous graphs, these show

    some unusual behavior around the 24th, both sporadic behavior starting on the 21st and an

    increased density on the 25th.

    Bundle Words per Body Bundle URLs per BodyComputers & Technology 74.917 Computers & Technology 1.018

    Entertainment 128.555 Entertainment 1.687

    Eszter Politics 79.919 Eszter Politics 1.125

    Politics 95.482 Politics 0.658

    Sports 67.006 Sports 0.383

    Subscriptions A-list 57.980 Subscriptions A-list 1.051

    Tables 2 & 3: Words per body (left) and URLs per body (right)

    Tables 2 and 3 report average statistics for each bundle in terms of the length of items

    and number of URLS mentioned in the body. From these, it can be observed that the

    entertainment blogs observed are likely to be lengthy and contain links while sports blogsare likely to contain short posts without links. However, it is important to note that some

    blogging packages limit the length of RSS feeds, which may have affected these

    numbers.

    Bundle Words per Body Factor

    Computers & Technology 157.872 2.107

    Entertainment 314.966 2.450

    Eszter Politics 287.090 3.592

    Politics 316.392 3.314Sports 581.732 8.682

    Subscriptions A-list 230.321 3.972

    Table 4: Words per body containing URLs

    Table 4 shows the average number of words per body when the item contains one or

    more URLs. The 3rd column contains the factor between items containing URLs and

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    23/28

    23

    items that do not. In all cases, the average number of words is significantly higher when

    an item contains a URL than when it does not, almost nine times higher in the case ofsports blogs. By comparing this information to the information in tables 2 and 3, we can

    conclude that a post that contains a URL is likely to contain multiple URLs. This shows

    that URLs are uncommon in almost all cases, appearing in somewhere between 10% and

    30% of items, on average.

    Bundle Local In-Bundle Out-of-Bundle

    Computers & Technology 1,690.536 1,177.572 7,131.892

    Entertainment 752.025 678.750 8,569.225

    Eszter Politics 687.449 1,259.742 8,052.809

    Politics 1,100.242 780.478 8,119.281

    Sports 638.540 564.424 8,797.035

    Subscriptions A-list 523.641 691.602 8,784.757

    Table 5: Number of URLs by type

    Table 5 is an analysis of the webfeed_item_urls table, examining what people link to.

    Local links are either absolute links, pointing to the blogs host, in the case of knownblog hosts, or the blogs domain, in the case of non-blog hosts, or relative links ( i.e., links

    that do not contain http://). Blog hosts for this experiment were typepad.com, blogs.com,

    blogspot.com and blogdrive.com. In a small number of cases, these links are also non-

    http links (e.g., mailto). In-bundle links are http links pointing to other blog hostscontained in the bundle. Out-of-bundle links are all other http links. To facilitate

    comparison, all numbers have been normalized per 10,000 URLs.3 These numbers show a

    large degree of similarity across bundles, though there is a significant difference between

    Computers & Technology and the rest. From these numbers we can conclude that there isa difference in the behavior of technology blogs and others in terms of linking behavior.

    Also interesting is the significant difference in ratio between local links and in-bundle

    links inEszter Politics and the rest of the bundles. This tells us that theEszter Politicsbundle is a tightly connected bundle, collecting blogs that relate to each other.

    Table 6 shows a small sampling of the words used by posts in the Politics bundle over a

    four-day period, including their relative frequency per 10,000 words2 and the changefrom the previous day. When viewed over a large window, it is possible to do a post-

    mortem of certain events, tracking when they started to gain in popularity and when they

    faded into obscurity. The data could also be combined with Kleinbergs techniques for

    identifying bursts of words [9] to identify popular events as they are happening.

    It is important to note that while the data was filtered against a list of common stop

    words, some constant terms (e.g., said, has) remain. For the purposes of searching, some

    3 Normalization is performed by taking the number and dividing by the total number of

    occurrences, then multiplying by 10,000 to find the frequency per 10,000 items; e.g.,number of times said appears / total number of words * 10,000.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    24/28

    24

    common terms should be included that are not useful for analysis. When looking for

    bursts, or when identifying lists of popular words, a two-pass approach would be best,first building the complete list from the database, then recomputing the list filtering out

    common words. One approach for filtering would be to track the change per word over

    time and to exclude words that have a small average change over a large window (e.g.,

    exclude words with an average change of 2 or less over the past two months).

    15-May 16-May 17-May 18-May

    Rank Word Freq +/- Word Freq +/- Word Freq +/- Word Freq +/-

    1 said 56.943 New said 68.936 0 has 59.085 1 has 59.378 0

    2 has 53.093 New has 63.208 0 said 59.000 -1 said 50.393 0

    3 about 46.736 New about 43.713 0 who 43.402 1 about 41.897 1

    4 who 45.393 New who 39.694 0 about 42.555 -1 who 41.213 -1

    5 were 36.440 New will 38.488 1 will 39.079 0 will 36.330 0

    6 will 33.754 New been 35.875 6 would 37.553 7 one 35.939 2

    7 would 33.038 New were 33.363 -2 out 31.280 5 would 33.791 -1

    8 all 32.769 New all 31.956 0 one 30.263 3 been 32.619 2

    9 one 32.142 New more 30.348 2 people 30.093 1 more 31.252 4

    10 people 31.068 New people 30.147 0 been 30.093 -4 all 30.470 2

    11 more 30.531 New one 29.645 -2 were 29.754 -4 out 30.275 -4

    12 been 28.740 New out 29.444 1 all 29.415 -4 were 28.126 -1

    13 out 28.382 New would 27.333 -6 more 28.737 -4 if 27.247 4

    14 no 27.487 New up 26.730 3 up 25.431 0 people 25.197 -5

    15 what 27.218 New Bush 25.123 11 Bush 25.346 0 up 25.099 -1

    16 so 25.069 New what 23.816 -1 like 23.905 12 can 23.634 14

    17 up 24.263 New some 23.414 4 if 23.397 2 our 23.341 2618 if 23.099 New Iraq 22.309 33 what 22.464 -2 what 22.071 0

    19 like 23.010 New if 22.108 -1 some 22.464 -2 some 21.583 0

    20 can 22.920 New can 21.806 0 into 22.464 7 other 21.583 7

    Table 6: Most popular words per day for the Politics bundle

    Table 7 shows statistics for the word Schiavo, including the rank among all words in

    the Politics bundle for that day, the relative frequency per 10,000 words and the changefrom the previous day. On the 15th, Schiavo was not ranked in the 1,000 most popular

    words. As the Schiavo case gained in national attention, the frequency exploded, peakingas the most popular word on the 22nd, as the courts and Congress debated her case. Yet

    her name remains one of the most popular words in the Politics bundle up until her death(and likely beyond). This pattern demonstrates both how a word will jump to the top of

    the list as a story breaks and also how significant events can be identified and their rise

    and fall tracked by looking back through the word lists.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    25/28

    25

    Date Rank Freq +/-

    15-Mar NA NA NA

    16-Mar 784 2.211 New

    17-Mar 524 2.967 260

    18-Mar 43 15.430 481

    19-Mar 28 20.438 15

    20-Mar 21 21.504 7

    21-Mar 23 23.101 -2

    22-Mar 1 83.628 22

    25-Mar 6 38.832 -5*

    26-Mar 5 38.011 1

    27-Mar 9 30.973 -4

    28-Mar 17 24.174 -8

    29-Mar 16 25.095 1

    30-Mar 24 20.818 -8

    31-Mar 14 26.166 10* difference between 22nd and 25th

    Table 7: Rank and frequency for the word Schiavo in the Politics bundle

    Conclusions

    With RAIn, we have created a framework for monitoring and analyzing RSS feeds. It is

    fairly lightweight, requiring only inexpensive hardware. The design is modular, allowing

    for the easy replacement of components to either support different functionality orimprove performance on a given system. RAIn is a complete system, including feed

    discovery, retrieval, archiving, indexing and a querying interface. It can be pointed at any

    site with an RSS feed and will enable archiving of the sites RSS and provides the abilityto search for items based on keywords. More complex queries can be performed to

    generate statistical information, either about a site or a group of sites.

    We also described some of the statistical analyses possible using RAIn. These range from

    simple metrics such as update frequency to complex analyses of the content of the items

    contained in feeds. For analysis, several bundles were defined. Each bundle contains a

    handpicked set of RSS feeds representing a particular blogging community. We were

    able to generate several different sets of statistical information about the blogs containedin the bundles as well as the other 77,373 blogs in RAIns database.

    Based on our experiences with RAIn, the system proved to be very capable. Inexpensivehardware supported processing more than 200,000 feeds per day. More expensive

    hardware, or a cluster of inexpensive hardware, should be capable of processing a

    significantly larger number of feeds. Despite claims that there may be more than 50

    million blogs worldwide, it is likely that there are significantly fewer that are actively

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    26/28

    26

    updated. A larger RAIn installation may be able to compete with sites like Technorati and

    Syndic8, monitoring a significant percentage of the active blogs in the world.

    Building upon the statistics, more complex analyses are possible. RAIn could easily be

    used as the basis of a word monitoring system. Simple word-burst techniques could be

    applied to watch for sudden changes in a word, finding significant events as they arehappening. In times of crisis, the Internet has proven to be the fastest source of news and

    information time and again. By actively monitoring RSS feeds, it may be possible to

    become aware of significant events before they hit the mainstream media.

    RAIn also accumulates a substantial amount of content from blogs. On top of the

    statistical indexing methods currently in use, full-text indexing methods could be applied

    to create a blog search engine. Combined with existing search technologies like Lucene,

    Nutch or XTF, the content could be easily indexed and searched, providing a verysubstantial searchable archive of blogs.

    RAIn proved to be very capable, monitoring a significant number of feeds on inexpensivehardware. More important, RAIn proved to be very flexible and adaptable. Asconfigured, a large number of statistical analyses can be performed on RAIns data.

    However, with RAIns raw data, any statistical analysis possible can be performed; one

    need only write the module to do it.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    27/28

    27

    References

    [1] Apache Software Foundation, The. Apache Lucene. May 10, 2005.

    [2] Barr, Jeff and Bill Kearney. Syndic8. May 10, 2005.

    [3] Lemburg, Marc-Andr. mxDateTime Date and Time types for Python. May 10,

    2005.

    [4] Goodnough, Abby and Maria Newman. Supreme Court Rejects Request to Reinsert

    Feeding Tube. New York Times March 24, 2005. May 10 2005.

    [5] Hastings, Kirk and Martin Haye. XTF (eXtensible Text Framework). May 10,

    2005.

    [6] Jacobsen, Kjetil and Markus Oberhumer. PycURL Home Page. April 6, 2005. May

    10, 2005.

    [7] Khare, Rohit, Doug Cutting, Kragen Sitaker and Adam Rifkin. Nutch: A Flexibleand Scalable Open-Source Web Search Engine. CommerceNet Labs Technical Report

    #04-04. May 10, 2005.

    [8] Klam, Matthew. Fear and Laptops on the Campaign Trail. New York TimesSeptember 26, 2004. May 10, 2005.

    [9] Kleinberg, Jon. Bursty and Hierarchical Structure in Streams. Proceedings of 8 th

    SIGKDD July 2002.

    [10] Libby, Dan. RSS 0.91 Spec, revision 3. July 10, 1999. May 10, 2005.

    [11] Mueller, Martin. The WordHoard Project. April, 2005. May 10, 2005.

    [12] Pilgrim, Mark. Universal Feed Parser. May 10, 2005.

    [13] PostgreSQL Global Development Group. PostgreSQL: The worlds most advanced

    open source database. May 10, 2005.

  • 8/6/2019 Tech Report NWU-CS-05-08: A System for Indexing and Archiving RSS Feeds

    28/28

    [14] Riley, Duncan. Number of blogs now exceeds 50 million worldwide. The Blog

    Herald April 14, 2005. May 10, 2005.

    [15] Sifry, David et al. Technorati. May 10, 2005.

    [16] van Rossum, Guido et al. Python Programming Language. May 10, 2005.

    [17] Winer, Dave. RSS 2.0 Specification. January 30, 2005. May 10, 2005.

    [18] Winstead Jr., Jim. blo.gs. May 10, 2005.