géraldine camile bibliothèque nationale de france tallinn, 2015-01-30 1

19
Harvesting digital newspapers at the Bibliothèque nationale de France Géraldine Camile Bibliothèque nationale de France Tallinn, 2015-01-30 1

Upload: kevon-cleek

Post on 11-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

1

Harvesting digital newspapers at the Bibliothèque nationale de

France

Géraldine CamileBibliothèque nationale de France

Tallinn, 2015-01-30

Summary Context and objectives of the

“subscription-based press project”Harvesting news websites with robotsResults and lessons learnt The future of the project – and its

alternatives

2

Context and objectives of the “subscription-based press project”

Collecting digital news at the BnF

Harvesting of news websites since 2010Use of crawlers100 news websites harvested every dayOnly freely accessible content

Using robots to collect digital equivalents of newspapers

“Subscription-based” press projectObtain passwords from publishers and crawl

protected contentFocus on the PDF versions to ensure collection

continuityAs microfilming budgets for local editions of

regional newspapers are decreasing

4

The subscription-based press project

Various actors within the LibraryLaw, Economy and Politics departmentLegal deposit department: printed periodicals

service Legal deposit department: digital legal deposit

serviceIT department

Different skills and approaches for printed and digital periodicals

CalendarA one-year experimentStarted end 2012; assessment end 2013Now in production mode

5

Harvesting news websites with robots

The harvesting workflow

Selection

Contact with

publisher

Technical instruction

Web harvest

Quality assuranceCataloguin

g

Description on

access UI

Curators

Curators

Library assistants

Cataloguers

Engineers

Preservation

Engineers

7

August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 8

Format

Cataloguing…

Link with the printed edition record

Link to the archives

Type: digital document

Local editions

And access in the archives…

August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 9

A guided tour of the news collection

August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 10

Long term preservation in SPAR, BnF’s digital repository

August 20th 2014Harvesting press websites at the BnF – Clément Oury – IFLA WLIC conference 11

Results and lessons learnt

August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 13

22 titles 192 local editions Start of harvest

Ouest-France 53 July 19, 2012

Le Républicain lorrain 8 December 12, 2012

Le Progrès 18 April 16, 2013

Midi libre 14 May 2, 2013

L’Indépendant 3 May 2, 2013

Centre Presse 1 May 2, 2013

La Tribune 1 May 22, 2013

Mediapart 1 July 16, 2013

La Montagne 14 October 10, 2013

Le Populaire du Centre 3 October 10, 2013

La République du Centre 2 October 10, 2013

Le Berry Républicain 1 October 10, 2013

L’Écho Républicain 1 October 10, 2013

Le Journal du Centre 1 October 10, 2013

Le Dauphiné libéré 20 April 7, 2014

Les Dernières Nouvelles d'Alsace

18 April 7, 2014

L'Est Républicain 10 April 7, 2014

L'Alsace 8 April 7, 2014

Le Journal de Saône-et-Loire 7 April 7, 2014

Le Bien Public 4 April 7, 2014

Vosges Matin 2 April 7, 2014

The collections

August 20th 2014Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference 14(n° 1, oct./nov. 2012, p. 60-61)

Harvested titles

Map of the daily regional newspapers

Vosges MatinLa Liberté de l’Est

Main achievementsThe collections!Technical experimentations of

harvest of protected contentCreation of links between the

General Catalogue and web archivesRaising awareness among wider

library staff about collecting digital publicationsEven library assistants are now

managing digital documents

15

The dark side of the crawlNews websites’ architecture

may change very quicklyRequires high reactivity and

dedicated time of technical staffDifficulty to recover non-

harvested collectionsPress collections disappear very

rapidly from the publisher’s website

Some websites are technically NOT possible to harvest with crawling robots

16

The future of the project – and its alternatives

The next steps of the projectExtend the harvest to new titlesImprove access to collections

A dedicated interface?Full-text index of the press corpus?

Promote the service towards: Librarians at reference desksResearchers and other users

Open remote accessFrom the researchers desktopsFrom regional libraries entitled to receive

access to web legal deposit collections18

Success and alternativesIdentify alternative ways of collection

Deposit from publishers through FTP?Deposit from press aggregators?Build upon the experience of the ebook

deposit workflowA successful project… which needs to

be complemented

19