news data at the british library
TRANSCRIPT
News data at the British Library
Luke McKernan
Lead Curator, News and Moving Image
Working with news data across different media
7 September 2015
www.bl.uk 2
Map of news stories in the UK as read via Twitter (created using bit.ly links), Guardian Datablog, 16 May 2012
Changing news
www.bl.uk 3
Moving from a world-class newspaper service to a world-class news service
Newspapers, television, radio and Web news
Reflection of the significant changes in news production and consumption taking place today, but it also reflects how news has always been consumed
News does not exist in any one form. It is sought out and selected by its users, from the multiple forms of information on offer
A change in how we manage news data is an essential part of how to deliver such change
“News is information of current interest for a specific audience”
News content strategy
The Newcastle Courant, The Huffington Post, Today, Al Jazeera English
www.bl.uk 4
Newspapers
The UK national collection
34,000 newspaper titles: approximately 60M issues or 450M individual pages, from 17thC to present day
Current acquisition: 1,500 daily or weekly titles
Print copies acquired under legal deposit but will move increasingly towards digital acquisition
Physical access at Newsroom and Boston Spa
Online access to 11M pages via British Newspaper Archive (http://www.britishnewspaperarchive.com)
Approximately third of collection has microfilm access copies; around 2.5% has been digitised so far
British Newspaper Archive
www.bl.uk 5
Television and radio news
Began recording television and radio news programmes receivable in the UK in May 2010
Collection of over 60,000 programmes, recorded off-air from 20 channels inc. BBC, Al-Jazeera, Russia Today, CNN, CCTV (China), NHK, Bloomberg, France 24, World Service, LBC
30 hours of TV and 22 hours of radio captured per day
Born digital archive, including Electronic Programme Guide data and subtitles where available
Access onsite only, owing to copyright restrictions, via Broadcast News service
Broadcast News
www.bl.uk 6
Web news
Non-print legal deposit legislation introduced in April 2013 means British Library can start harvesting UK websites
First annual crawl collected 4.5M .uk websites and web pages – collection now amounts to around 3Bn digital assets
Harvesting c.1000 UK news websites (newspapers and web-only sites e.g. hyperlocals) on daily/weekly basis, from end of 2013, with another 500 to be added soon
Access onsite only at British Library and other Legal Deposit libraries
Also Open UK Web Archive, smaller collection of selected websites, openly available at http://www.webarchive.org.uk
UK Web Archive
www.bl.uk 7
Our news research services
Explore.bl.uk The Newsroom Boston Spa reading room
British Newspaper Archive UK Web Archive Broadcast News
www.bl.uk 8
News data
2M 19thC British newspaper pages – XML, images
UK television news data 2010 onwards – EPG data for 45,000 programmes, subtitles (XML) for c.25,000 programmes, some speech-to-text files for 2011 broadcasts (XML)
UK radio news data 2010 onwards – EPG data for 15,000 programmes, some speech-to-text files for 2011 broadcasts (XML)
Financial Times – four years of content (1888, 1939, 1966, 1991) – XML, images
Web news selection – possibly
Financial Times, 1893 and 2008
www.bl.uk 9
Plans
All out-of-copyright UK newspapers on British Newspaper Archive, issue level data for research re-use, covered by single agreement, available through an API. Possibly…
Title-level data for all newspapers we hold (34,000 titles) released as open data
More partner initiatives
Hackathon on 16 November 2015, to be followed by other news data events in 2016
User-led development
BBC radio news script, 14/7/1969
www.bl.uk 10
Dreams
An open news dataset
An archive news data model
All British Library news records available at issue level
Hyperlocal news sites: On the Wight, The City Talking, A Little Bit of Stone
www.bl.uk 11
Questions
Copyright constraints limit use of much material to BL premises – how can tools such as named entity extraction work as a means to get round this?
How can print, web, television, radio news, and other news media, be linked up together, and to other resources, and how would this benefit research?
What research questions will we be able to support through a greater focus on news data?
Is news data only for the specialist, or can more general user-friendly applications be produced?
What can news archives learn from the management tools for current news?
How can we help each other?TV news idents
www.bl.uk 12
Email: [email protected]
Twitter: @BL_newsroom
Web: http://bl.uk/subjects/news-media
Blog: http://britishlibrary.typepad.co.uk/thenewsroom
Contact