dayta.me - a personal news + data recommender for your...

http://dayta.me - A Personal News + DataRecommender for Your Day

Ali Al-Mahrubi, Gianluca Correndo, Augustinas Grieze, Stephen Lynch, IanMillard, Emmanuel Munyadzwe, Tope Omitola, Igor Popov, Darren

Richardson, Manuel Salvadores, Daniel A. Smith, Max Van Kleek, Yang Yang,mc schraefel, Prof. Wendy Hall, Prof. Tim Berners-Lee, and Prof. Nigel

Shadbolt

EnAKTing, Electronics and Computer Science,University of Southampton, Southampton, UK.

http://dayta.me

Abstract. http://dayta.me is a highly personal information recom-mender that augments a person’s online calendar with useful informa-tion pertaining to their upcoming activities. To perform this recommen-dation, it draws upon a large collection of distributed, linked-data andWeb 2.0 data sources it queries live, and provides a clear, simple userinterface that everyday citizens can easily use to contextualize formerlyinaccessible information within easy and convenient reach.

Keywords: information presentation, recommendation service, linked data

1 Introduction

The wide array of location-based services that have emerged recently yearshave started to demonstrate the usefulness of delivering just-in-time informa-tion based upon the users current whereabouts and activity. But such emphasison the users active context limits the potential utility of this information, asit is often too late for the user to make effective use of it. For example, it isof limited use for a person to be told about bad weather or traffic if they arealready experiencing it first-hand, or of a colleagues being unable to attend ameeting after it has already begun.

It would thus be considerably more useful if systems could recommend in-formation in advance of when a user will likely need it. A core requirementtowards achieving such a system, of course, is a reliable source of short-termpredictions about the user’s likely upcoming locations and activities. While au-tomatic algorithms that can make such predictions are still not realizable, oneoften-overlooked source of information that is of predictive use for many peopleis already widely available: online personal calendars.

For the 2010 Semantic Web Challenge, we have designed dayta.me, a per-sonal data recommender driven by a users personal calendar that blends (RDF)Linked Data [2] and Web 2.0 data sources to make aggressively personalized

2 Lecture Notes in Computer Science: EnAKTing Team

Fig. 1. dayta.me UI showing 3 columns: events, the main (focused) event, and recom-mended items. Recommended items are collated by type.

recommendations of web-based information items that the user will likely needduring her day. These recommendations are based on the people she will meet,places she will visit, and the activities she will be engaged in, according to hercurrent location and calendar. These information items are not restricted tothose recommended by traditional news and social media items such as newsarticles, status updates, tweets, etc. but also can be used to recommend viewsof arbitrary RDF information, including visualisations of statistical data, bothfrom real-time sources (such as weather, traffic) as well as from archived, largestatistical data sets, such as public sector data archived by governments (e.g.,data.gov and data.gov.uk)

The foremost aim of this project is to provide regular citizens with a view oftheir daily calendars that is enriched with pertinent information selected fromthe huge quantities of Linked Data available today. This information should aimto help them maintain better situational awareness and be more prepared forupcoming events. A secondary goal is to demonstrate how Linked Data and semi-structured (Web 2.0) data sources can be used in tandem to solve a commontask: aggressive personalised recommendation.

We first begin with a walk through of the system, and follow this with a de-tailed description of how the system works, focusing on the challenges towardsmixing Linked Data and other Web sources of knowledge during the recommen-dation process.

Dayta.me - A Personal News + Data Recommender for Your Day 3

2 Scenario based walk-through

In this section, we start with a quick walk-through scenario of how a user inter-acts with dayta.me.

Mario is a first year postgraduate studying Web Science. Being new to thefield, his supervisor suggests he attend a Web Science conference in London.While the conference site provides an abundance of information about the eventand attendees, Mario currently lacks the time to look at it in depth, so he simplycopies and pastes the event details into his Google Calendar.

Two days before the conference, Mario opens his dayta.me (Figure 1) and isreminded by the entry that he needs to finalize his travel plans, including findinga decent hotel. When he clicks on the event in his calendar,dayta.me suggestsa variety of information pertaining to the event itself: tweets by people hostingthe conference, news and blog articles about the conference, and recommendednews articles about topics related to the conference.

In addition to these items, dayta.me also recommends a variety of statisticsabout the conference’s location, including crime rates in the area over recentyears, and weather and temperature forecasts for the day in question. These areslices of government data sets (from data.gov(.uk)) and real-time data sources(for traffic, weather, etc, such as from http://data.london.gov.uk/apibeta)selected and rendered into easy-to-read summary charts. He is dismayed to dis-cover that certain areas near the conference have particularly high crime (Figure2b). As the conference ends late, he decides to look for hotels nearby so he willnot have to walk far. Additionally, being new to the UK, he explores the pos-sibility of doing some sightseeing after the conference. Selecting the map view(Figure 2a) reveals dpbedia points of interest within walking distance of theconference. He clicks on the National Gallery, which causes the interface to slideto the right, revealing the National Gallery’s wikipedia article in the centre.

Fig. 2. (a) Map view for recommended items. (b) Example of statistical (SCOVO)interactive widget for public sector info, such as mortality rates, unemployment etc.


Fig. 3. dayta.me architecture and data flow, described in Section 3

3 dayta.me architecture

dayta.me consists of two sets two major components: the dayta interface (whichruns in the browser), and its server-side support: the recommender engine, atriple store, and the new data gremlin. The way that these components interactto quickly and efficiently recommend items based on the user’s context andcalendar items is illustrated in Figure 3 and described as follows:

1. Retrieve personal events and user context - When a user logs into dayta.me, dayta.mes browser-side code first retrieves the users private calendars1,and displays upcoming events in the left-hand-column. At the same time,the browser also retrieves the users sensed location through the HTML5-standardized Geolocation API.

2. Event selection - When the user clicks on an event, that event and the theusers location are bundled up and transmitted to the server to serve asthe basis for recommendation. Once on the server, it is then transformedinto RDF using the semantic enrichment process described in 5 and kept ina private, user-specific graph that is discarded as soon as the transactionends.

3. Recommendation - The core recommendation process is a simple rule-basedrecommender in which modules known as rules compute scores for each can-didate data item during the voting process. When such scores are computed,they are aggregated (summed) and then used to rank the items. In dayta.me,Those that have top scores above a threshold are passed to the lens selectionengine.

1 Currently this is done using Google Calendar GDATA API, but will be expanded toother popular online calendars soon.


4. Lens selection - Since the best way to display a particular information itemlargely depends on what it represents, dayta.me uses lenses to flexibly de-fine how particular information items should be visually presented in theinterface. Lenses, similar to Fresnel, are RDF entities consisting of a pattern(which specifeis the structures of entities it is designed to render), an HTMLtemplate, associated CSS styling, and (optionally) javascript controller code.Lenses are matched with data items to be presented by applying each lens’pattern (expressed in SPARQL) with each data item successively and deter-mining whether the item meets the query specification. (The most specificmatched lens is chosen when multiple lenses match each item).

5. Rendering - Upon receiving the relevant data items and their correspondinglenses, the browser instantiates the HTML template fragments of the lenseswith their corresponding items to produce the interactive view of these rec-ommended elements.

6. New item retrieval - New items are retrieved regularly by the new datagremlin, based upon source definitions stored in the triple store. Meta-datafor items retrieved is extracted as described in Section 5.

3.1 Data sources

dayta.me currently uses more than 30 data sources and services (of 17 differ-ent varieties), for three different purposes, as illustrated in Table 1. The first,and currently smallest category consists of sources of user context, which cur-rently consists of the users calendar (i.e., Google Calendar), the browser (forlocation), and, the user herself (i.e., manually supplied events). The second cat-egory consists of sources of items to recommend, of which dayta.me uses thelargest number of sources; this includes events/article feeds (RSS/ATOM) fromboth news outlets and blogs, Twitter tweet streams, Freebase and dbpedia forarticles on people and entities mentioned, government PSI statistical data sets,NAPTAN, and openly local. The final category pertains to data sources thatare consulted as external knowledgebases by rules during the recommendationprocess to compute relevance.

Sources of user context Sources of items to recommend Sources used duringrecommendation

Google Calendar (Events) RSS feeds (e.g. BBC, NYT, xkcd..) OpenCalais (extraction)Browser (Geolocation) Twitter (tweets) Geonames (disambiguation)Events form dbpedia EnAKTing Geoservice

(containment/proximity resolution)User clicks on items Freebase (biographic profiles) Freebase (disambiguation)

NAPTAN (rail, bus stops) DBpedia (topic distance)EnAKTing statistics datasets(mortality, crime,...)

http://mapit.mysociety.org

TFL (traffic, tube status) Ordnance Survey(postcode/disambiguation)

Weather Service

Table 1. Data sources used by dayta.me, categorized by how they are used.


4 Related work

Among the three types of commonly known recommendation systems [1], dayta.me falls within the realm of knowledge-based recommendation systems, incorpo-rating knowledge embodied in a wide array of knowledgebases, including dbpe-dia, various geographic knowledge bases, and specialised knowledge sources. Thefact that most of dayta.me’s knowledge lives elsewhere allows it to be viewedas a decentralized recommender, similar to work by Heitmann[5]. Finally, sincesome of the features used for recommendation are user-contextual (e.g location),dayta.me should be considered a contextual recommender [3].

5 Discussion and challenges

The biggest challenge towards realizing dayta.me surrounded compensating forthe lack of suitable meta-data for use in recommendation, and reconciling refer-ence ambiguity across heterogeneous sources. As described in Sec. 1, dayta.me’srules require a variety of meta-data about both the basis of recommendations(e.g., events and user context) and the recommendation candidates themselves.While RDF provides an expressive way to represent this meta-data, few of thepopular data sources available today offer such descriptions. (This is in part dueto the widespread adoption of the simpler, limited RSS 0.95 and ATOM newssyndication formats, over the richer RDF-based RSS 1.0.) Instead, the formatsthese sources provide offer minimal structure, little flexibility, and values areexpressed in unstructured text rather than unambiguous URIs, resulting in theneed to explicitly obtain more information about these items.

Entity extraction Thus to cope with these formats and generate appropriatemeta-data needed for recommendation, dayta.me runs a information extractionprocess over the meta-data fields of retrieved items, and, at times, over theentire item(s) themselves. The extraction of these references is currently doneusing the OpenCalais named entity recognition APIs2, which parses the text andproduces a summary of all the relevant lexical terms with an OpenCalais URIthat identifies it and a type.

Entity disambiguation Unfortunately the URIs provided by the Open CalaisAPI are not disambiguated in the sense that they often represent terms ratherthan concepts. For example, the URI extracted for the text “Oxford, UnitedKingdom” results in a URI3, that, when dereferenced links to all cities known as“Oxford”, regardless of location or the fact that the original text contained theterms “United Kingdom”. Therefore, the semantification process is only partialand, although the resource returned is typed as City, it is in fact an unknowncity or a set of cities.2 www.opencalais.com3 Oxford: http://d.opencalais.com/genericHasher-1/3966df4f-\\3a0a-3de2-b179-48423c2d7c72


The geographical disambiguation process takes the City and Country labelsand exploits the Geonames REST services for discovering geographical entities.The URI now provided represents the actual geographical entity with less ambi-guity. There is still uncertainty obviously about the context of discourse; are wetalking about a constituency in Oxford or the city of Oxford? Another source oferror is due to the fact that the named entity recognition process can retrievemore than one term of cities and one or more term of countries. With no furtherknowledge, the only possible way to proceed is to check all the possible combi-nations. Worst case scenario is when the news contain references to English orAmerican cities and then a reference to UK and US: there are in fact at least 35cities in the US named “Oxford”, one in the UK.

In recommending statistical data, we used APIs that could make reversegeocoding (at least for the UK) that provided references to the Ordnance Sur-vey (OS henceforth) geographical backbone of URIs. Using the http://mapit.mysociety.org APIs we have been able to retrieve OS URIs for the smallerentities and, with the support of the EnAKTing geoservice[4], query for all thecontainer URI that were used to describe statistical data in the EnAKTingdata sets. Ultimately, once an entity was disambiguated, it was transformedinto (WGS84) latitude and longitude coordinates. This representation servedas the lingua franca for mapping from ambiguous entity references to specificURI spaces, such as the Ordannce Survey or geonames. A similar process wasfollowed for people, because URIs returned by OpenCalais had few owl:sameAsequivalences for people as well. Since Freebase URIs are, by contrast, well-linkedto many important hubs (e.g. DBpedia and the New York Times), we leveragedtheir API 4 for person resolution.

6 Acknowledgements

This work was supported by the EnAKTing project funded by the Engineeringand Physical Sciences Research Council under contract EP/G008493/1.

References

1. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender sys-tems: A survey of the state-of-the-art and possible extensions. IEEE transactionson knowledge and data engineering pp. 734–749 (2005)

2. Berners-Lee, T.: Design issues: Linked data. http://www.w3.org/DesignIssues/LinkedData.html(2006)

3. Burke, R.: Hybrid recommender systems: Survey and experiments. User Modelingand User-Adapted Interaction 12(4), 331–370 (2002)

4. Correndo, G., Salvadores, M., Yang, Y., Gibbins, N., Shadbolt, N.: Geographicalservice: a compass for the web of data. In: WWW2010 Workshop: Linked Data onthe Web (LDOW2010) (April 2010)

5. Heitmann, B., Hayes, C.: Using linked data to build open, collaborative recom-mender systems. In: AAAI Spring Symposium: Linked Data Meets Artificial Intel-ligence (2010)

4 http://wiki.freebase.com/wiki/API


7 Appendix

Main criteria Does dayta.me meet it?The app has to be an end-user app, (i.e.an app that provides a practical value togeneral Web users)

Absolutely. dayta.me was designed for daily-use by non-specialized Web users.

Info sources used should be under diverseownership or control, should be heteroge-neous (syntactically, structurally, and se-mantically), and contain substantial realworld data.

Yes. As described in Section 3.1 dayta.me relies on 30+ datasources many under different ownership. Some are linked-datasources while others are “Web 2.0” semistructured sources w/APIs. All data sources are “real”, queried live and only cachedfor performance.

The meaning of data has to play a cen-tral role. “Meaning” must be representedusing Semantic Web technologies.

Yes. As described in Sec. 5, the core recommendation process relieson semantics extracted from candidates items (i.e.topics, places,persons of relevance). Furthermore the relevance between entities(topics, entities) is assessed using semantic distance.

Data must be manipulated/processed ininteresting ways to derive useful info; thisprocessing has to play a central role inachieving things that alternative tech-nologies cannot do as well, or at all.

Yes. Semantic technologies are used to enabled widely-heteogeneous information to be leveraged towards generat-ing highly personalized recommendations driven by extremelysparse data (e.g., a single user’s calendar events and location).This would be difficult with standard statistical (collaborative-filtering) approaches. (All pattern-directed processing of the sys-tem is done using SPARQL queries).

Additional desirable features How dayta.me meets itThe application provides an attractiveand functional Web interface (for humanusers)

Yes. We followed a user-centered design process, working withexpert UI evaluations to ensure dayta.me would be appropriatefor regular end-users. Features such as the “why was this rec-ommended” explanations were added to make the system moretransparent and understandable.

The application should be scalable (interms of the amount of data used and interms of distributed components workingtogether). Ideally, the application shoulduse all the data that is currently pub-lished on the Semantic Web.

Yes. 1) dayta.me’s open architecture permits new data sources andinfo types can be integrated easily. E.g., the use of lenses, allowsviews for new types to be seamlessly integrated. 2) dayta.me’sback-end supports scaling to a large quantity of data sources andusers. The typical bottleneck, the triple store, is high-performance4store cluster, and the front-end recommendation servers scalehorizontally.

Rigorous evaluations have taken placethat demonstrate the benefits of semantictechnologies, or validate the results ob-tained.

We are planning a deployment (to 100s of users) in October 2010.

Novelty, in applying semantic technologyto a domain or task that have not beenconsidered before

While this clearly derives from earlier semantic recommenders,it is unique in several ways: 1) relies on both Web2.0 and linkeddata sources in multiple places, 2) it is driven by both user contextand their calendar 3) generates customized, appropriate views ofSCOVO data sets.

Functionality is different from or goes be-yond pure info retrieval

Yes. Dayta.me extends into the space of dynamic proactive/just-in-time retrieval systems.

The application has clear commercial po-tential and/or large existing user base

Yes, an extremely large potential user base: people who use onlinecalendaring systems and news aggregation services (and wish tocombine the two!).

Contextual info is used for ratings orrankings

Yes, the user’s location and calendar.

Multimedia documents are used in someway

Videos or music are regularly recommended in feeds.

There is a use of dynamic data (e.g.workflows), perhaps in combination withstatic info

Dynamic and static data are both used throughout the system. Forexample, dynamic sources: user’s location, weather, traffic, tweetand news streams. Static data: the user’s calendar, data.gov.uk,NAPTAN statstics. See Sec 3.1.

The results should be as accurate as pos-sible (e.g. use a ranking of results accord-ing to context)

Accuracy is determined by disambiguation process and rule per-formance, which, while not perfect, is acceptable for the app.

dayta.me - a personal news + data recommender for your...

Documents