harvesting and semantically tagging media releases from political websites using web services

20
Department of Parliamentary Services Parliamentary Library Harvesting and semantically tagging media releases from political websites using web services Peter Neish, Systems Officer Victorian Parliamentary Library @peterneish

Upload: peter-neish

Post on 22-Nov-2014

2.531 views

Category:

Technology


4 download

DESCRIPTION

Presented at VALA2012 by Peter Neish on February 9 2012 describing how media releases were automatically harvested from political websites by polling the RSS feeds or relevant sites. Media releases were semantically tagged using the OpenCalais web service.

TRANSCRIPT

Page 1: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Harvesting and semantically tagging media releases from political websites

using web services

Peter Neish, Systems OfficerVictorian Parliamentary Library

@peterneish

Page 2: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

What will be covered

• Background – why are we interested in media releases

• What we did and how went about it

– Part 1: Automatic Harvester

– Part 2: Semantic Tagging

• Lessons Learnt, where to from here – that kind of thing

Page 3: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

About the library

• Established in 1851

• Clients:

– Members of Parliament and their staff

– Department of Parliamentary Services (especially committees)

– Academics and public

• Approx. 25 staff in client support, research, reference, tech services and e-services.

Page 4: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Media Releases

• Media releases play an important part in the political process

• Establish a party’s position on an issue at a particular point in time

• Often used in reference requests

• Library established a media release database in 1992

Page 5: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Number of Media Releases per year

0

1000

2000

3000

4000

5000

6000

7000

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

Page 6: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Media releases by party

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

ALP

Coalition

Green

Independent

Page 7: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Project aims• Automate the process of

adding media releases to the database

• Combine our different databases together – better user experience

• Examine possibility of automatically applying tags to media releases using web services

Page 8: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Part 1: Automation

• Political parties have website

• Websites have Content Management System (CMS)

• CMS has RSS feed

• RSS can be used as a standard input to software

Page 9: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

What we built

Polls RSS feedfor links

DBTextworks

wkhtml2pdf

Servlet

Metadata

Page 10: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Technologies used

• Java / Tomcat

• Rome RSS parser https://rome.dev.java.net/

• Wkhtmltopdf http://code.google.com/p/wkhtmltopdf/

• DB/Textworks

• MySQL (for semantic tags – later)

Page 11: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Results

• It works – since July 2010 we’ve harvested 11,000 media releases

• Saved c. 2 days of staff time per week

Problems

• Non-standard content in feeds (e.g. dates)

– Addressed with Yahoo! Pipes

• Website’s changing their structure or CMS

Page 12: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Part 2: Semantic Tagging

• Increasing number of media releases meant that manual indexing was too time consuming

• Examined ways of automatically tagging media releases without human intervention

• Services examined:

– Alchemy API– Evri– OpenAmplify– OpenCalais– Yahoo Term Extractor– Zemanta

Page 13: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Open Calais

• Product of Thomson Reuters – focus is on news articles

• Good number of tags (not too low or high)

• Minimal false matches

• Good documentation and community

• Generous limits on API calls

However: closed box (algorithm secret), recently company appear to have scaled back development

Page 14: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Example Open Calais

• http://viewer.opencalais.com/

Page 15: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Number of Tags assigned by OpenCalais

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 20 40 60 80 100 120

Tags per item

To

tal n

um

be

r

Page 16: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

User interface

Page 17: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

85%

4%

6%5%

Correct Tags

Incorrect Tags

Repeated Tags

Redundant Tags

Tag Quality

Page 18: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Problems - disambiguation

Victoria, Australia

• Population: 5.5 Million

• Area: 237,629 km2

• Google Hits: 48 million

Victoria, Seychelles

• Population: 25,000

• Area: 451 km2

• Google Hits: 2.5 million

Photo by http://www.flickr.com/photos/meckimac/2971992/ Photo by http://www.flickr.com/photos/eclogite/257560117/

Page 19: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Linked Data• OpenCalais links to its own ontology (rich in data for companies,

but other classes have limited data)

• SameAs or web links to:

– DBpedia

– Wikipedia

– Freebase

– Reuters.com

– GeoNames

– Shopping.com

– IMDB

– LinkedMDB

Page 20: Harvesting and semantically tagging media releases from political websites using web services

Department of Parliamentary ServicesParliamentary Library

Conclusion

• Been able to save 2 days per week of staff time with automatic harvesting

• Media releases now available as they are released (no backlog)

• Using web services we have been able to tag content with meaningful semantic tags to enrich our data

• Can link our databases together using common tags and to other databases in the Linked Data ecosystem