1 unione europea digital libraries on the grid to preserve cultural heritage a use case: federico de...
TRANSCRIPT
1
UNIONE EUROPEA
Digital Libraries on the Grid to preserve cultural Heritage
A use case: Federico De Roberto manuscripts
Leandro Ciuffo on behalf ofDr. Antonio Calanducci([email protected])Istituto Nazionale di Fisica Nucleare – Catania
2
Federico De Roberto cultural heritage
• De Roberto, an Italian writer of the XIX/XX century, born in Naples, but spending his life in Catania, has left to the humanistic communities numerous works
• Those are made up of valuable and hard-to-manage pieces: manuscripts, typescripts, draft with handwriting corrections, magazines, cuts, sketches, photos
3
3
Fondo letterario De Roberto
• Digitalization of manuscripts, typescripts, printed works– TIFF Files, one per page, 600 dpi, about 100MB for A3
High resolution scans for in-depth examination
– Multipage PDF, one per work, 300 dpi, varying file sizes 40-400MB Overall examination of works
– 8000 scans, 2 Terabyte of disk space– Different physical formats, A3/A4/custom size
55
Digitalization
4
Fondo letterario De Roberto
• Embedded Metadata– TIFF with embedded metadata to provide scan physical features
and information about the content ImageWidth, ImageHeight, XResolution, FileSize, CreationDate,
ModifyDate Description, Keywords, CaptionWriter, Title, Author, Copyright
Status, Copyright Notice
– Added with Photoshop after the digitalization phase (Adobe XMP format)
55
Metadata
5
Obiettivi e requisiti
• Make those works accessible to the humanistic research communities
• Immediately find the desired document– Document organization according the physical and semantic
metadata By type By category Dynamic filtering of search result set according the selection of one or
more document metadata
• Long-term preservation (digital preservation)– Multiple copies (replicas) spread in different geographical sites
– Reliability of storage systems and replica redundancy to achieve secure preservation
66
Goals and requirements
6
Data Management in Grid
• Storage Element(SE): front-end server aggregating a set of (pool) hard disks providing the illusion of a big (virtual) disk
77
“container” of users’ files generally one SE per site mirrored disks to avoid data loss in case of hardware
failures fine-grained set up of file permissions: owner, group,
given lists of users and groups (Access Control Lists - ACLs)
Keep the mapping file-physical disk of the pool
• File Catalogue: provide a unique virtual file system among several Storage Elements: keep track of which SE (or SEs) contains a given file
– keep track of replicas– mapping file-Storage Element filename
Data Management in Grid
7
Data Management in Grid
• Metadata Catalogue: store and organize metadata of files saved on Storage Elements and registered on the File Catalogue
– metadata organized by “collection” (sort of directory) each collection has its schema, a set of defined attributes:
• es: /deroberto/scans/manuscripts o Title: “La lupa”o Author: “Federico De Roberto, Giovanni Verga”o Genre: “Tragedia Lirica”o Pages: 34o FileType: TIFFo surl:
srm://infn-se-01.ct.pi2s2.it/dpm/ct.pi2s2.it/home/cometa/generated/2008-06-14/filede4d6266-56c4-4d66-95b6-3d69063ef081
– responsible to answer users’ queries against metadata describing files, to find out their physical location for future retrieval
88
Data Management in Grid
8 99
The Sicilian Grid COMETA
9 99
300+ TBytes
International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Current deployment - (COMETA Grid)
10
gLibrary project
• Challenge:– to offer a intuitive, flexible, secure and multiplatform
system to handle digital libraries on a Grid infrastructure
• Digital Assets: (items handled in a digital library)– Any kind of content and/or media represented as a digital
file. Es.: Images (Photos, Scans, Screenshots, Logos, ...) Audio (Songs, Sound Tracks, Ringtones, ...) Video (Movie, Trailers, Mobile phone videos, ...) Presentations, Letters, Reports, Invoices, Receipts E-Books, E-Mails, Papers, Magazines etc etc...
• gLibrary allows to store, organize, search and retrieve digital assets on a Grid environment
1010
The gLibrary project
11
Caratteristiche di gLibrary• Intuitive front-end implemented as a web application:
– accessible from everywhere, it needs only Internet access– usable by any web browser (Internet Explorer, Mozilla
Firefox, Opera, Safari) from any operating system (Windows, Linux, Mac Os X) ---> multiplatform It requires a Java Virtual Machine (available on any OS)
1111
– Extensive usage of AJAX (Asyncronous JavaScript and XML)
make web applications dynamic and interactive providing a desktop-like user experience
International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
gLibrary features
12
Organizzazione delle DL
• “Types” and “Categories” definition by repository providers:
12
• Assets are organized by type:
– a list of specific attributes to describe each kind of asset to be managed by the system
– hierarchical (a child type shares and extend parent’s attributes)
– queried during searches
• and/or organized by category:
– Group together related assets of different types;
– Useful also to define subsets of assets belonging to the same type
– Multiple category assignment per asset (tagging)
International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Assets organization
13
Ricerca intuitiva• Assets are browsed selecting a type (or category) and
selecting one or more filters:– attributes of the selected types, chosen from a defined list, used to
narrow the result set
• Filter application is cascading and context-sensitive: the selection of a filter value dynamically influences subsequent filter values (“à la iTunes” browsing)
– Classical search by description and keywords available too
1313International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Intuitive and instant search
14
Dettaglio dell’asset selezionato
1414International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Details of asset selection
15
Memorizzare e recuperare gli assets
• Users can upload their local assets on one or more (creating replicas) Storage Elements of the Grid
– Uploads managed through Java Applets
– Files already on SE can be included in a digital library by the File File Catalogue browser
• Download from SEs to the users’ laptop/desktop:– selection of a replica link from a list– download java applet
1515International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Assets storing and retrieval
16
Sicurezza e gestione degli utenti• Being a grid application, gLibrary inherits all the
security features coming from the underlying technologies
– X.509 digital certificates authentication – Transfers based on proxy authorization – VOMS (Virtual Organization Membership Service) usage to
distinguish users and assign the right permissions
• 3 kind of user role for each digital library deployed:– gLibraryManager:
define the hierarchies of types and categories (with their attributs) and filters
grant submission rights to generic users
– gLibrarySubmitter: upload new assets and define permissions on its entries (fine-grained rights assignment)
– generic users: enabled to searches and downloads (on assets they have rights to)
1616International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Security and user management
17
Architettura di gLibrary
1717User
Login applet
AMGA MetadataCatalogue
LFC FileCatalogue
SE
SE
SE
Upload/Download applet
VOMS Server
1. local proxy creation
2. proxy transfer
over HTTPS
3. get role
6. direct transfer from SE
5. proxy retrieved over HTTPS
4. find the right asset
gLibrary architecture
18
Possibili scenari d’uso
• Suitable to communities with needs of sharing big amount of digital resources in a easy and secure way
• Some examples:– “consumer” users: sharing of photos, music, movies,
documents, office, etc..– enterprise/industrial/research communities: presentations,
invoices, layouts, sounds, scans, manuscripts :)
• Each community defines how to describe their content (and how to search for it) setting permissions in order to grant or deny access to specific users, groups and whole organizations, exploiting the huge storage capabilities, organization and security features offered by a Grid infrastructure
• A use case: “De Roberto Digital Repository”
1818International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Possible usage scenarios
19
• Goals:– to store the 8000 scans of De Roberto Heritage ---->
Grid Storage Elements– to enable an ubiquitous and 24/24h access to scientists
---> web application– document organization for a fast search ---> metadata
services– long-term digital preservation of data ---> redundancy
through replicas of files on several Storage Elements– easy-to-use interface for searches, organization, upload and
download of digitalized documents
• ----->
1919
20
Metadata per la DR digital library
• Types definition for the assets of the DR library
• Attributes definition per type. Es:
2020
Attributo Valore
Title la lupa
Author federico de roberto, giovanni verga
Description manoscritto della tragedia lirica …
Keywords verismo, federico de roberto, la lupa, …
CaptionWriter stefania iannizzotto, alessandro …
CopyrightStatus copyrighted
PageNum 5
TotalPages 34
DocumentGenre tragedia lirica
PublicationYear 1916
Publsher officine tipo-litografiche barravecchia e balestrini
FileType PDF
Resolution 300
ScanQuality good
• Filter definition per type. Es:
• DocumentGenre
• Title
• FileType
• ScanQuality
• DocumentType
• PublicationYear
• PublicationStatus
• Publisher
• Location
Metadata used in the DR digital library
21
Browsing and filtering screenshot
2121
22
Downloading
2222
Downloading
23
Download completato
2323
Download completed
24
Upload
2424International Workshop on Cyberinfrastructure and Archeology, San Mianiato (PI), 16th-17th Oct 08
Upload
25
Estrazione automatica dei metadati
• There are some libraries that allow automatic metadata extraction from given file types:
– exiftool– Imagero
• Both have been able to read XMP metadata. Es:– $ exiftool -E -XMP:Subject -XMP:Description -XMP:Rights -XMP:Title -XMP:Author -FileName -FileSize
001\ gli\ illustri\ amanti.tif
– Subject : federico de roberto, manoscritti letterari, verismo, gli illustri amanti, la.mu.s.a., facoltà di lettere e filosofia catania, società di storia patria per la sicilia orientale
– Description : manoscritto de gli illustri amanti, conservato presso la biblioteca della società di storia patria per la sicilia orientale
– Rights : società di storia patria per la sicilia orientale catania.la.mu.s.a., facoltà di lettere e filosofia, università degli studi di catania
– Title : gli illustri amanti
– File Name : 001 gli illustri amanti.tif
– File Size : 106 MB
• We are working to integrate those libraries to speed up the acquisition stage
2525
Automatic metadata extraction
26
Conclusioni
• gLibrary challenge is to offer a flexible, multiplatform, secure and easy-to-use system to handle digital libraries on Grid
– flexible: allow to handle any kind of asset, defined by the library admin
– multiplatform: implemented as a web application with Java applets can be accessed by any OS
– secure: fine grained permission (Grid certificate based) can be set for assets
– easy-to-use: its intuitive interface, with “à la iTunes” browser allows to find the desired asset with just a few mouse clicks
• In a few weeks a prototype of the De Roberto Digital Repository has been implemented with gLibrary. It will enable scientists to access those works from anywhere and anytime in a simple and smart way and it will allow the long-term preservation of this cultural heritage
2626
Summary
27
Riferimenti
• Contact: [email protected], [email protected]
• Prototype of the De Roberto Digital Repository:– https://glibrary.ct.infn.it/deroberto/
• gLibrary project homepage (currently under maintaince):
– https://glibrary.ct.infn.it/
• Papers:A. Calanducci, C. Cherubino, L. N. Ciuffo, D. Scardaci, “A Digital Library
Management System for the Grid”, Fourth International Workshop on Emerging Technologies for Next-generation GRID (ETNGRID 2007) at 16th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE-2007), GET/INT Paris, France, June 18-20, 2007 (http://etngrid.diit.unict.it/2007/index.html).
• A. Calanducci, C. Cherubino, L. N. Ciuffo, D. Scardaci, “gLibrary: Digital Asset Management System for the Grid”, IEEE Hypermedia and Grid Systems Conference at 30th Jubilee International Convention MIPRO, Opatija, Croatia, May 21-25 2007 (http://www.mipro.hr/) 2727
References
282828
Thanks for your attention
https://glibrary.ct.infn.it/deroberto/
Thanks for your attention
https://glibrary.ct.infn.it/deroberto/