harvesting and semantically tagging media releases from political websites using web services
DESCRIPTION
Presented at VALA2012 by Peter Neish on February 9 2012 describing how media releases were automatically harvested from political websites by polling the RSS feeds or relevant sites. Media releases were semantically tagged using the OpenCalais web service.TRANSCRIPT
Department of Parliamentary ServicesParliamentary Library
Harvesting and semantically tagging media releases from political websites
using web services
Peter Neish, Systems OfficerVictorian Parliamentary Library
@peterneish
Department of Parliamentary ServicesParliamentary Library
What will be covered
• Background – why are we interested in media releases
• What we did and how went about it
– Part 1: Automatic Harvester
– Part 2: Semantic Tagging
• Lessons Learnt, where to from here – that kind of thing
Department of Parliamentary ServicesParliamentary Library
About the library
• Established in 1851
• Clients:
– Members of Parliament and their staff
– Department of Parliamentary Services (especially committees)
– Academics and public
• Approx. 25 staff in client support, research, reference, tech services and e-services.
Department of Parliamentary ServicesParliamentary Library
Media Releases
• Media releases play an important part in the political process
• Establish a party’s position on an issue at a particular point in time
• Often used in reference requests
• Library established a media release database in 1992
Department of Parliamentary ServicesParliamentary Library
Number of Media Releases per year
0
1000
2000
3000
4000
5000
6000
7000
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Department of Parliamentary ServicesParliamentary Library
Media releases by party
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
ALP
Coalition
Green
Independent
Department of Parliamentary ServicesParliamentary Library
Project aims• Automate the process of
adding media releases to the database
• Combine our different databases together – better user experience
• Examine possibility of automatically applying tags to media releases using web services
Department of Parliamentary ServicesParliamentary Library
Part 1: Automation
• Political parties have website
• Websites have Content Management System (CMS)
• CMS has RSS feed
• RSS can be used as a standard input to software
Department of Parliamentary ServicesParliamentary Library
What we built
Polls RSS feedfor links
DBTextworks
wkhtml2pdf
Servlet
Metadata
Department of Parliamentary ServicesParliamentary Library
Technologies used
• Java / Tomcat
• Rome RSS parser https://rome.dev.java.net/
• Wkhtmltopdf http://code.google.com/p/wkhtmltopdf/
• DB/Textworks
• MySQL (for semantic tags – later)
Department of Parliamentary ServicesParliamentary Library
Results
• It works – since July 2010 we’ve harvested 11,000 media releases
• Saved c. 2 days of staff time per week
Problems
• Non-standard content in feeds (e.g. dates)
– Addressed with Yahoo! Pipes
• Website’s changing their structure or CMS
Department of Parliamentary ServicesParliamentary Library
Part 2: Semantic Tagging
• Increasing number of media releases meant that manual indexing was too time consuming
• Examined ways of automatically tagging media releases without human intervention
• Services examined:
– Alchemy API– Evri– OpenAmplify– OpenCalais– Yahoo Term Extractor– Zemanta
Department of Parliamentary ServicesParliamentary Library
Open Calais
• Product of Thomson Reuters – focus is on news articles
• Good number of tags (not too low or high)
• Minimal false matches
• Good documentation and community
• Generous limits on API calls
However: closed box (algorithm secret), recently company appear to have scaled back development
Department of Parliamentary ServicesParliamentary Library
Example Open Calais
• http://viewer.opencalais.com/
Department of Parliamentary ServicesParliamentary Library
Number of Tags assigned by OpenCalais
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 20 40 60 80 100 120
Tags per item
To
tal n
um
be
r
Department of Parliamentary ServicesParliamentary Library
User interface
Department of Parliamentary ServicesParliamentary Library
85%
4%
6%5%
Correct Tags
Incorrect Tags
Repeated Tags
Redundant Tags
Tag Quality
Department of Parliamentary ServicesParliamentary Library
Problems - disambiguation
Victoria, Australia
• Population: 5.5 Million
• Area: 237,629 km2
• Google Hits: 48 million
Victoria, Seychelles
• Population: 25,000
• Area: 451 km2
• Google Hits: 2.5 million
Photo by http://www.flickr.com/photos/meckimac/2971992/ Photo by http://www.flickr.com/photos/eclogite/257560117/
Department of Parliamentary ServicesParliamentary Library
Linked Data• OpenCalais links to its own ontology (rich in data for companies,
but other classes have limited data)
• SameAs or web links to:
– DBpedia
– Wikipedia
– Freebase
– Reuters.com
– GeoNames
– Shopping.com
– IMDB
– LinkedMDB
Department of Parliamentary ServicesParliamentary Library
Conclusion
• Been able to save 2 days per week of staff time with automatic harvesting
• Media releases now available as they are released (no backlog)
• Using web services we have been able to tag content with meaningful semantic tags to enrich our data
• Can link our databases together using common tags and to other databases in the Linked Data ecosystem