bhl technology overview
DESCRIPTION
Presentation to Smithsonian's Office of the Chief Information Officer.TRANSCRIPT
![Page 1: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/1.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Biodiversity Heritage Library (BHL):Technology Overview
Chris FreelandDirector, Bioinformatics
Missouri Botanical Garden
Technical DirectorBiodiversity Heritage [email protected]
www.biodiversitylibrary.org
![Page 2: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/2.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
BHL Partners
Museums– American Museum of
Natural History (New York)
– Natural History Museum (London)
– Smithsonian Institution (Washington)
– The Field Museum (Chicago)
Botanical Gardens– Missouri Botanical Garden– New York Botanical Garden– Royal Botanic Garden, Kew
University Libraries– Botany Libraries, Harvard University– Ernst Meyer Library of the Museum
of Comparative Zoology, Harvard University
– University of Illinois
Bioinformatics Institutes – MBL/WHOI– uBio.org
![Page 3: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/3.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Why have BHL?In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find; it is not enough to copy from a previous author; he must verify each reference itself from the original.
Charles Davies Sherborn, Epilogue to Index Animalium, March 1922
Charles Davies Sherborn (1861-1942)
![Page 4: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/4.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Unique Components of BHL
• Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal
• Use of JPEG2000• Web 2.0 Mashups• Taxonomic data mining• Services• Rare & novel content
![Page 5: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/5.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scanning process
1. Select Book2. Pull from Shelf3. Send to IA scanning center4. Book is scanned & QA5. Page images loaded on IA cluster
1. Derivatives created
6. Book returned to library7. Files harvested from IA portal8. Books available for display within BHL portal
![Page 6: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/6.jpg)
Mushrooms of America, edible and poisonous. Ed. by Julius A. Palmer, Jr. , 1885.
![Page 7: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/7.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scan & Store: Internet Archive
Scanning on Scribes
Storage in Petaboxes
![Page 8: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/8.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Scanning & Derivatives
• XML• JP2
• PDF• JPG• TXT• DJVu
Master Derivatives
![Page 9: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/9.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Harvest from IA
Extract, Transform, Load (ETL)
• Custom scripts to extract content via IA’s APIs
• Database scripts to transform to relational data structure
• Load into database
![Page 10: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/10.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 11: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/11.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 12: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/12.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 13: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/13.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Stable URL
Attribution
Name Finding
Page Turning Page TurningZoom/Pan
Download/View
Browse
Search
Filter
Target/Object
![Page 14: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/14.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
JPEG2000 (*.jp2) display
• RAW original => 85% .jp2
• LuraTech encoder– Wavelet compression
• LizardTech decoder– Tiled on the fly,
cached for performance
• GSIV browser-based client viewer– ‘AJAXian’
![Page 15: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/15.jpg)
LizardTech ExpressServer
Browser GSIV.js
www.biodiversitylibrary.org
.jp2
.jpg
IA
/page/1274907
pageid: 1274907
BHLdb
http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2
images.mobot.org
A user requests Mushrooms of America, edible and poisonous, Plate X:http://www.biodiversitylibrary.org/page/1274907
locate:
BHL/IA architecture
= 5.0+ sec transfer
Time to deliver image: 8+ sec
![Page 16: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/16.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Reuse, don’t rebuild
![Page 17: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/17.jpg)
TIF Image from ScannerConverted to text via PrimeOCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response
Name Finding in action
with Taxonomic Intelligence…
![Page 18: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/18.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Names data mining
![Page 19: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/19.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Tag cloud from LCSHSubject Heading from library catalog
Expressed as MARCXML
Tag Cloud
![Page 20: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/20.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Geocoding LCSH
![Page 21: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/21.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
RSS Feeds
Specific: Last 25 books published in German from NYBGRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG
1. Allgemeine deutsche Garten-Zeitung, 7, 1829 (added: 04/03/2008 ) 2. Zeitschrift fr wissenschaftliche Mikroskopie und fr mikroskopische
Technik. 2, 1885 (added: 03/28/2008 ) 3. Zeitschrift fr technische Biologie. 7, 1919 (added: 03/27/2008 ) 4. …
General: Last 25 books from all librariesRSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25
1. Summa plantarum : v.1 (added: 05/01/2008 ) 2. Vegetable materia medica of the United States (added: 04/30/2008 ) 3. The family herbal; (added: 04/30/2008 ) 4. …
![Page 22: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/22.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Services
• Names– v.1 released
http://www.biodiversitylibrary.org/services/name/NameService.asmx
• Stable urls– http://www.biodiversitylibrary.org/bibliography/1652– http://www.biodiversitylibrary.org/name/Carcharodon_carcharias
• Future:– Citation Resolver– Titles Resolver
![Page 23: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/23.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx
![Page 24: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/24.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Provider Integration
• Encyclopedia of Life
• Atrium Andes Biodiversity
• Wikipedia
• EDIT Scratchpads
• More to come…
![Page 25: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/25.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 26: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/26.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 27: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/27.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Hardware Infrastructure
• Distributed
• Partially redundant– Work needed
• Mixed platforms
• Mixed app frameworks
![Page 28: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/28.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
MOBOT
Petabox cluster
Internet Archive
![Page 29: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/29.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
![Page 30: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/30.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
File Storage Estimates
• 4MB per page including derivatives
• 1 million pages = 4TB storage
• Expected output:60 – 100 million pages
240 - 400 TB for files
10 - 20 GB for db
![Page 31: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/31.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Future Work
• Services– Citation Resolver– Titles Resolver
• Interfaces
• Editing– Authoritative– Community
• Backend
![Page 32: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/32.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Fedora
• Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons
• Working with Internet Archive to define use and practice
• Project completionDecember 2009
![Page 33: BHL Technology Overview](https://reader035.vdocument.in/reader035/viewer/2022062319/5579ae4cd8b42ac1148b4fad/html5/thumbnails/33.jpg)
© 2008 Biodiversity Heritage Library www.biodiversitylibrary.org
Thank You
Chris Freeland
BHL Portal
www.biodiversitylibrary.org
BHL Blog
biodiversitylibrary.blogspot.com
BHL collection at Internet Archive
www.archive.org/details/biodiversity