slide 1 experiences on migration of data in digitization projects julián bescós presentation for...
TRANSCRIPT
Slide 1
Experiences on Migration of Data in Digitization Projects
Julián BescósJulián Bescós
Presentation for the ERPANET WorkshopWorkflow in Digital PreservationBudapest, 13-15 October 2004
Slide 2
1. The Migration Issue 2. Our Experience 3. Migration Tasks 4. Best Practices for Preservation5. Planning and Schedule
OVERVIEW
Slide 3
• Migration is the set of tasks to achieve periodic transfer of digital materials from one hard/soft configuration to another
Purpose • Long term preservation of the digital information created
and stored using digital technology
• Allow broad access– Retrieve, display and use
Origin • New devices, processes and software replace the methods
to record, store and access
• New standards
• Enhancement of service
MIGRATION
Slide 4
• Technology obsolescence– HardwareHardware
More powerfull computers and higher density storageElements for updating are not available ( increase of
storage, memory, etc)– Basic softwareBasic software
Operating systemsData base managers
• Media– Lifetime is rarely the constraining factor for DPLifetime is rarely the constraining factor for DP– Obsolescence of old storage media as newer and better media are Obsolescence of old storage media as newer and better media are
available in the marketavailable in the market
• Obsolescence of the Access software – Access in new platform and mediaAccess in new platform and media– Not available long term programsNot available long term programs– Changes in metadata and in image formats Changes in metadata and in image formats – New functions of the softwareNew functions of the software
ORIGIN OF MIGRATION
Slide 5
• In practice it is a combination of:– Technology obsolescence Technology obsolescence – New functionalities of the softwareNew functionalities of the software– Derived from information and communication technologyDerived from information and communication technology– Daily work on: digitisation, storage and access requiring:Daily work on: digitisation, storage and access requiring:
Higher density storageFaster computers
• It is a consequence of:
– The digital world of information and communication The digital world of information and communication technology is still relatively young and inmature technology is still relatively young and inmature
ORIGIN OF MIGRATION
Slide 6
• Beginning in 1988 with the design and development of the Information System for the Archivo de Indias in Seville
• Computarization of 66 Archives and Libraries of different kinds and sizes in Spain and abroad
• Digitalization of more than 20 millions pages of ancient documents
• Installation of more than 320 workstations
• Development of the own products ArchiDOC-ArchiGES for Archives
• With a team in the areas of consulting, managing, development, installation, trainning and maintenance of systems for archives
EXPERIENCE IN DIGITALIZATION PROJECTS
Archivo General de Indias, Sevilla Access Room in 1992
Slide 7
Archivo General de Indias, Sevilla
Archivo General de Simancas
Archivo Histórico Nacional, Madrid
Archivo Histórico Nacional - Sección Nobleza, Toledo
Archivo Histórico Nacional Sección Guerra Civil, Salamanca
Archivo de la Corona de Aragón, Barcelona
Archivo General de Navarra
Archivo del Reino de Valencia
Archivo del Reino de Mallorca
Biblioteca Sancho el Sabio, Vitoria
Archivo Virtual de la corona de Aragón ( con Imágenes del ACA y AHN)
Archivo Eclesiástico de Poblet
Archivo Histórico Universidad de Salamanca
Archivo Histórico de la Universidad de Santiago de Compostela
Archivo Histórico de la Universidad de Oviedo
Archivo General de la Nación, Colombia
Archivo Histórico Ultramarino, Lisboa
Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya
Biblioteca Valenciana Archivo del Ilustre Colegio Notarial de Granada
Real Academia Española (Diccionarios Histórico)
Diccionario Biográfico Real Academia Historia
Archivo General Militar, Segovia
Archivo General Militar, Ávila
Instituto de Historia y Cultura Militar
Archivo General de la Marina, El Viso del Marqués, Ciudad Real
Archivo Histórico Provincial de Murcia
Sistema de Información del Archivo, Biblioteca, Fototeca y Videoteca de Cruz Roja Española
Biblioteca de la Fundación Francisco de Zabalburu, Madrid
Biblioteca Parlamento Vasco
Archivo-Biblioteca de la Diputación de Cáceres
Digitalización de 11 periódicos para 11 Instituciones Vascas de Prensa retrospectiva y prensa actual
Archivo Municipal de Castellón de la Plana
Archivo Histórico del Excmo. Ayuntamiento de La Laguna, Tenerife
Archivo del Ayuntamiento Oviedo
Archivo del Komintern, Moscow and its replica in 6 National Archives, LOC and Open Society Archives
MAIN PROJECTS WITH DIGITALIZATION
Archivo General de Navarra
Archivo General Militar, Segovia
Zabalburu Library
Slide 8
Date Institution Number of Images Kind of Images 89-02 Archivo General de Indias, Sevilla 11.000.000 Manuscripts XVI-XIX 97- Archivo General de la Nación, Colombia 1.000.000 Manuscripts 94-00 Archivo General de Simancas 1.000.000 Manuscripts 97-04 Archivo General Militar, Ávila 180.000 Expedientes Militares 97-04 Archivo General Militar, Segovia 300.000 Expedientes Militares 98-04 Archivo General de Navarra 450.000 Manuscritos medievales 98- Archivo General de la Marina, El Viso del Marqués, Ciudad Real 150.000 Manuscripts 96- Archivo de la Real Chancillería de Valladolid Manuscripts 93-03 Archivo Histórico Nacional, Madrid 3.000.000 Manuscripts 95-01 Archivo Histórico Nacional - Sección Nobleza, Toledo 300.000 Manuscripts 96 Archivo Histórico Nacional Sección Guerra Civil, Salamanca Manuscripts 96 Archivo Histórico Provincial, Vizcaya 97 Archivo Histórico Provincial de Murcia 250.000 Protocols 99-02 Archivo Histórico Provincial de Oviedo 95 Archivo Histórico Ultramarino, Lisboa Manuscritos antiguos 95-04 Archivo de la Corona de Aragón, Barcelona 200.000 Medieval Manuscripts 94-01 Archivo Histórico de la Universidad de Salamanca 700.000 Manuscripts 96-02 Archivo Histórico de la Universidad de Oviedo 97-04 Archivo Histórico de la Universidad de Santiago de Compostela 400.000 Manuscripts 98-02 Archivo del Komintern, Moscú 1.000.000 Documents 1900-1945 93-04 Biblioteca y Archivo de la Fundación Sancho el Sabio, Vitoria 1.100.000 Monographs XVI-XIX 96-02 Biblioteca de la Fundación Francisco de Zabalburu, Madrid 700.000 Manus. y Mon. 96-00 Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya 100.000 97-01 Archivo Histórico del Excmo. Ayuntamiento de La Laguna,Tenerife 100.000 Manuscripts 96 Archivo del Ilustre Colegio Notarial de Granada 200.000 Protocols 1998 Instituto de Historia y Cultura Militar 100.000 Manuscipts 95-00 Archivo Eclesiástico de Poblet 200.000 Manuscipts 98- Archivo-Biblioteca de la Diputación de Cáceres 200.000 Actas 98 Archivo Municipal de Castellón de la Plana 98 Centro de Investigaciones Biológicas (CSIC)
FIGURES OF DIGITALIZATION
Slide 9
Date Institution Number of Images Kind of Images 96-00 Real Academia Española Historical Dictionaries 96-00 Digitalización de 11 periódicos para 11 Instituciones Vascas 300.000/year Ancien Journals 99-04 Archivo Histórico Provincial Cantabria 2000 Archivo Ayuntamiento Estella 00-02 Archivo y Biblioteca Cruz Roja Photographs Monog. 00-04 Archivo Virtual de Aragón ( Imágenes del ACA y AHN) Medieval Manuscripts 00-01 Proyecto AER ( Con AGI y AHN inicialmente) 00-04 Biblioteca Parlamento Vasco 300.000 Monographs 01-04 Archivo del Reino de Valencia Manuscripts, Protocols 01-02 Diccionario Biográfico Real Academia Historia 01 Archivo del Ayuntamiento Oviedo Padrones XV 01-04 Archivo del Reino de Mallorca 02 Sistema Archivos Principado Asturias 02 Archivo Casa de Alba
FIGURES OF DIGITALIZATION
Slide 10
1. Projects from 1988 – 1992: Computer System for Archivo General de Indias
• The Archive contains 86 million of pages of original manuscripts related to the Spanish Administration in America (XV-XIX centuries), in 43.000 bundles
• The Computer System integrated:–A Textual Data Base with 400.000 descriptive entries–A Digital Image Archive with 11 million digital images in
1995 –A Module for User and Document Management: Control
of User management, Consultation room, documents movements and statistics
• Access by researchers and archivists from 50 workstations
• About 30% of present consultations are on the screen (1 million pages/year )
• About 35% of printing are digital ( 85.000/year )
• Access system in service since 1992
EXPERIENCES ON MIGRATION
Slide 11
Architecture
• The Data Base for Descriptions in SQL/400 keeps the hierarchical structure of fonds
• Standalone Digitization Workstations with flat bed scanners and optical disk driver under DOS
• Images servers based on PCs with optical disk drivers
• Access from PCs under OS/2Image Acquisition and Storage
• 11 million images digitized in gray levels with high fidelity with respect to the original manuscripts
• Low cost workstations
• Legibility Enhancements applied by users at the consultation time
• Non expert digitization operators
• Digitization: 100 dpi, 16 gray levels
• 1 Page/minute, 15 workstations, 2 turns, 4 years
EXPERIENCES ON MIGRATION
Slide 12
Image Acquisition and Storage
• Images stored in WORM optical disks–The structure at the low level (
bundle/documents ) was also in directories in the WORM disks
–Access to images in one disk done through the call number of the document
–Images path as metadata: images names had information about document call number and number of page.
–Not available standard compression for gray level images. Images were DPCM compressed by software without losses.
• Compressed Image size of A4: 300-350 Kbytes
• Storage for 1 bundle: 2000 x 350 = 700 MB
EXPERIENCES ON MIGRATION
Slide 13
Image Acquisition and Storage
• Media for storage of digital images:
Bundles Media Year beg. Number of disks Images
1.729 IBM optical disks ( 200 MB) 1989 6.916 3.458.000
3.732 Plasmon optical disks ( 940 MB) 1991 3.732 7.464.000
50 CD-R (640 MB) 1996 100.000
EXPERIENCES ON MIGRATION
Slide 14
Slide 15
Slide 16
Slide 17
Example of blotches removal to be applied by the user
Slide 18
Slide 19
Example of reduction of ink bleeding through the paper
Slide 20
Archivo General de Indias
Digitization Room of Archivo de Indias in 1989
Slide 21
Archivo General de Indias
Shelf with optical disks
Slide 22
2. Projects from 1992 – 1996:
– Data Base Server under OS/2 and DB2 – Access and Digitization workstations from PCs with OS/2– The relational Data Base keeps the hierarchical structure of
documentation – Images stored in CDRs
Directory structures and image names changed.Metadata in binary control files: Each image has
information about signature, position in hierarchical structure, number of page, notes
Image compression: JPEGMetadata in images: resolution, date, dimensions
EXPERIENCES ON MIGRATION
Slide 23
Example: metadata in Binary Control File
– The file keeps information about the hierarchical structure– It maintains relationship between each
image file and its position in the document.
– The control file and its metadata can be imported into the database
EXPERIENCES ON MIGRATION
Slide 24
Migration of Images of Archivo de Indias from 10.600 optical disks to 6.000 CD-Rs
– The images of a bundle are stored in 1 or 2 CD-R– Reading of optical disks through the network– No direct connectivity between optical disks and Windows
NT
– Main Operation Tasks:Decompression of the DPCM formatCompression on JPEG formatTemporary storage in magnetic diskAll images of the bundle are copied in CD-RVerification of images by reading6.000 CD-Rs, and 6.000 CD-Rs backup copy
EXPERIENCES ON MIGRATION
Slide 25
EXPERIENCES ON MIGRATION
IBM Optical Drives
Microchannel IBM PS/2File system driver for OS/2OS/2 1.3 and Lan ServerTokenRing Microchannel Card
CD-R Drives
Token RingNetwork
Pentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface
IBM Disks to CD-R
Migration of Images from 6.916 WORM IBM disks to CD-Rs– Typically 4 WORM disks ( 200 MB each) in 1 or 2 CD-R
Slide 26
Migration of Images from 3.732 WORM Plasmon to CD-Rs– 1 WORM Plasmon disk ( 940 MB) in 1 or 2 CD-R
EXPERIENCES ON MIGRATION
Pentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface
Plasmon Drives
PC with i486SCSI interfaceFile system driver for OS/2OS/2 3.0 Ethernet card
CD-R Drives
HUB EthernetNetwork
HUB EthernetNetwork
Plasmon Disks to CD-R
Slide 27
Migration of Images of Archivo de Indias from 10.600 optical disks to 6.000 CD-Rs – Requirements of personnel and timeRequirements of personnel and time
3 operators during 4 months3 operators during 4 months
EXPERIENCES ON MIGRATION
Similar migration schemes with less images:
•Library Sancho el Sabio ( Vitoria) 1.000.000 images
•University of Salamanca 700.000 images
•Archivo General Militar, Segovia 200.000 images
•Archivo del Monasterio Poblet 100.000 images
Slide 28
3.Projects from 1996 to now:
– Oracle Data Base– Access and Digitization workstations with PCs with W/NT,.. W
XP – Capturing Images also using standard programs and their
metadata– Images stored in magnetic disks. CDROMS as backup
Metadata in database: Scanning operator, date of creation, Signature, path, dimensions in bytes… Data about control of the information
Metadata in image: resolution, dimensions… Data for presentation in computers and for printing
Image quality: 200 – 300 dpi, 256 gray levels Color images
Standard formats: TIFF, CCITTGIV JPEG, PDF,
EXPERIENCES ON MIGRATION
Slide 29
Example: metadata in database
EXPERIENCES ON MIGRATION
Management of Image Access
Modes of Image Display
Slide 30
Example: metadata XML File
– Same functionality than binary control file
– Standard: virtually any program can import these metadata
EXPERIENCES ON MIGRATION
Slide 31
Migration of Archivo de Indias from CD-R to magnetic disk in 2000
– Project for online access and InternetJust copy. Images are already with JPEG compression10 RAID cabinets of 350 GB each ( 8 disks x 50 GB )1 operator was required during 1 month for the copy
from a CD-ROM tower to magnetic disks– Transfer rate from different media:
Media Transfer rate Image BundleIBM optical disk 60 KBs 6 seconds 4 hoursPlasmon optical disk 100 KB/s 3 seconds 1 hourCD-R 16x 2,5 MB/s <1 second 5 minutesMagnetic disk 80 MB/s 1 minute
Similar Migrations:Sancho Sabio Library ( Vitoria) 1 million imagesZabalburu Library 700.000 imagesMilitary Archives 500.000 imagesArchivo General Navarra 600.000 imagesKomintern Archives (Moscow) 1 million images........
EXPERIENCES ON MIGRATION
Komintern Archives, Moscow
Slide 32
UPS
UPS
Image Server
RAID Cabinet 1
RAID Cabinet 2
RAID Cabinet 3
RAID Cabinet 4
RAID Cabinet 5
RAID Cabinet 6
RAID Cabinet 7
RAID Cabinet 8
RAID Cabinet 9
RAID Cabinet 10
Data Base Server
Domain Controler Server
WEB Server
UPS
UPS
Archivo General de Indias
SERVERS AND IMAGE STORAGE
Slide 33
Reserved UPS
Data Base Server
Domain Controler Servers
UPS
WEB Servers
Image Server
RAID Cabinet 1
RAID Cabinet 2
UPS
Reserved for RAID Cabinet 3
Auto Replicated on line Remote Disk subsystemfor Back up and Service
Red local
Archivo General de Indias
Slide 34
• Analysis of origin and destination data models
• Equivalence between of the fields in the origin and destination models
– New versions include new metadata not available before
• Development of migration software
• Testing with a limited number of objects
• Display of information in a destination card
• Application of migration to all data
• Verification of results
• Correction of errors:– Sometimes some images cannot be copied and must be
recoverd from alternative media or even to be digitised again
MIGRATION TASKS
Komintern Archives, Moscow
Slide 35 Komintern Archives, Moscow
MAIN COST FACTORS
• Preparation of the system for migration– Hardware and Basic Software:
Magnetic disk storage for imagesPCs with appropriate OS and DB manager
• Development of Software (1 programmer, 2-3 weeks work ) – Software development for migration– Testing of migration of data
• Operation ( usually less than 1 week)– Significant operation with removable media
Slide 36
• General principles– Based on PC’s and mainstream commercial equipmentBased on PC’s and mainstream commercial equipment– Key hardware provided by first class IT companiesKey hardware provided by first class IT companies– Database managers of widespread useDatabase managers of widespread use– Consultations with institutions undertaking projectsConsultations with institutions undertaking projects– Based on elements and standard formats. Officials or the Based on elements and standard formats. Officials or the
facto, like TIFF, JPEG, XML, etc. facto, like TIFF, JPEG, XML, etc. – Modular, allowing a progressive installation and easy update Modular, allowing a progressive installation and easy update
of elementsof elements– Selection of software:Selection of software:
FunctionalitiesNumber of installationsMaintenanceProvided by a IT company settled in the sector
– Key factors:Key factors:Server, operating system, database managerBackup policies
BEST PRACTICES FOR PRESERVATION
Slide 37
• Digitization– Capture systems:Capture systems:
Robust flatbed scanners (A3)Zenithal scanners. Digital cameras with limitations.
– Use of standard compression formats. JPEG, CCITTGIV Use of standard compression formats. JPEG, CCITTGIV – Ensure that digital images will allow a broad range of future Ensure that digital images will allow a broad range of future
useuse– Capture the highest quality image technically possible and Capture the highest quality image technically possible and
economically feasible for large-scale production economically feasible for large-scale production – Capture the informational content / physical appearanceCapture the informational content / physical appearance– Fast and easy correction of errors Fast and easy correction of errors
• Criteria for holding selection– ValueValue– ConditionCondition– UseUse– Acceptability of the digital objectAcceptability of the digital object– Access aidsAccess aids
BEST PRACTICES FOR PRESERVATION
Slide 38
• Storage– Media of wide use and low cost: Media of wide use and low cost:
Magnetic disk for on line image service (specially in high demand)
Disks with redundancyBackup in tapes of high capacity (10/20GB)One or two units available as hotsawpIt allows migration without personnel operation
In a distributed network they may need to be stored online in multiple locations
CD-R or DVD as backup for off line access in case of system failure
– In general there is little experience in storing massive In general there is little experience in storing massive quantities of culturally valuable materials quantities of culturally valuable materials
• Backup and Recovery– Use industry standard backup and recovery procedures:Use industry standard backup and recovery procedures:
Periodic backup to magnetic tape A copy held on site for near term recoveryA copy off-site stored for disaster recovery
BEST PRACTICES FOR PRESERVATION
Slide 39
Traditional approach of Computer Science
• Migration of media– Refreshing digital information by copying it from medium to Refreshing digital information by copying it from medium to
mediummedium– Conversion of files to another format to be interpreted by new Conversion of files to another format to be interpreted by new
programs; to a reduced number of standard formats; programs; to a reduced number of standard formats;
• Migration of technology platform– Server and PCsServer and PCs– PeriphericalsPeriphericals– Capture devices and CDR writersCapture devices and CDR writers– Operating system and database managerOperating system and database manager
• Migration of the digitising and access software– Maintenance of software in new platformMaintenance of software in new platform– New software versions for digitising and accessNew software versions for digitising and access
APPLICATION OF MIGRATION
Slide 40
• Planning for migration is difficult due to:
– the limited experience
– we cannot predict when media, soft and hard will become obsoleted
• No single strategy applies to all formats of digital information
• It varies in different applicational environments, for different formats of digital materials and for preserving different degrees of computation, display and retrieval
• It requires a unique new solution for each new format and process
• Automatic conversion is only partially possible
• In general there are no firm plans for migration, but to stay up to date with current technologies by migration the content
• Usually there is urgency involved in migration: due by the obsolescence of soft and hard
PLANNING
Slide 41
• Schedule
– New releases of software, databases,etc. can be expected every 2-3 years, with minor updates more often
– Migration from one storage media to another every 4-5 years, if not online
– Migration to new hardware and software occur less frequently but can be expected between 5-10 years
SCHEDULE
Slide 42
• Best practices for Digital Preservation
– Mainstream commercial equipment
– Use of standard formats
– Storage in magnetic disk with redundancy
– Backup policies
– Maintenance
• Periodical Update Policy
– Hardware
– Media
– Basic sofware
– Application software
SUMMARY