(big) bibliographic data @ scads project meeting - 2015-06-12

22
(Big) Bibliographic Data UB Leipzig & SLUB Dresden ScaDS project meeting, 12.6.2015 Leander Seige, Felix Lohmeier, Ralf Talkenberger

Upload: felix-lohmeier

Post on 28-Jul-2015

482 views

Category:

Education


5 download

TRANSCRIPT

Page 1: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

(Big) Bibliographic DataUB Leipzig & SLUB Dresden

ScaDS project meeting, 12.6.2015

Leander Seige, Felix Lohmeier, Ralf Talkenberger

Page 2: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

“The library of the

21st century

is a data hub.”quoted from an internal strategic paper of

Leipzig University Library, 2015

Page 3: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

simple bibliographic metadata

<metadata>title

authorisbn

publisheryear…

<resource>booksserials

newspapersarticles

...

Page 4: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

<resource> book● printed books in the library’s shelves

● bought ebooks

● licensed ebooks

● pay-per-use ebooks

● free content

● ebooks to be bought by the library (patron driven acquisition = pda)

● even printed books to be bought by the library (pda too)

Page 5: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

<resource> journals● printed journals in the library’s shelves

● much more licensed electronic journals

○ full text accessible via web interfaces

● do we have article metadata?

● yes: licensed journal articles: 10s of millions per library

Page 6: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

<metadata> accessibility information● where is a ressource? (physical or on the net)

● who is allowed to access this content? (students? faculty? everyone?)

● is it available off-campus?

● did we buy it or is it just licensed?

● may the user copy or print it?

● is the library allowed to store the electronic file?

● may we grant access from wifi connections?

● ...or any combination of these...

Page 7: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

<metadata> knowledge bases● librarians built large knowledge bases to describe resources

● in german speaking countries: GND (Gemeinsame Normdatei) der

Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd

● international: http://viaf.org

● provide dbpedia-links to explore the linked data cloud and to enrich

library data

Page 8: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

<metadata> knowledge bases● GND (and other national authority files via VIAF)

○ describe Persons, Corporate bodies, Conferences and Events,

Geographic Information, Topics, Works and relationships

between them

○ form a generic knowledge base, independent from any specific

domain

○ provide links to other knowledge bases (dbpedia, geonames...)

Page 9: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

resource discovery● traditional “OPACs” provided access to traditional library resources like

printed books, users had to use proprietary vendor drive portals to

access electronic ressources

● today, printed materials represent only a small part of library resources

● in contrast: resource discovery systems aim to integrate all

resources of a library and present them in one single search

interface

Page 10: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Cooperation● UBL and SLUB joined forces in March 2015

● Goals:

a. Exchange of metadata after processing

b. Develop common workflows to avoid “double work”

→ integrate existing tools finc & d:swarm

Page 11: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

finc Community● maintains a large search engine infrastructure

● developed and hosted at Leipzig University Library

● based on Apache Solr und VuFind

● rugged metadata management system,

processing millions of data records each day

● integrates more than 50 data sources

https://finc.info

Page 12: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

finc Community● provides more than 15 university libraries with

resource discovery systems

● offers great potential to design and implement user oriented

functions on real world systems, serving thousands of library

users in Saxony and beyond, every day

● employs the aggregated index at Leipzig University Library

https://finc.info

Page 13: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

10% physical items

90% electronic content

on the net

aggregated index atLeipzig University Library

Page 14: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

aggregated index atLeipzig University Library

● 12 million traditional data records (growing)● 80 million electronic article data records (growing)● each records contains 20 data fields

1.8 billion triple(if you triplify it)

(without any enrichment data)

Page 15: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Data processing today

● distributed data storage○ 2 Solr in Leipzig

(~12 mio + ~80 mio records)○ 2 Solr in Dresden

(~2 mio + ~2 mio records)

● constraint: each data source is handled separately → difficult to build up relations and deep data integration

Page 16: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

d:swarm

● yet another tool…?

a. property graph database

b. gui for library staff

Page 17: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Toolsfinc d:swarm

focus data normalization data integration and enrichment

technology script-based transformations (python, go, ElasticSearch)

encapsulates metafacture (open source toolchain for metadata transformation)

Property Graph (Neo4j)

status Works fine with ~100 mio. records (less than one day)

Scability issues (~ 4 mio. records in less than one day)

Page 18: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

integrating finc with d:swarm● enhance data processing regarding

○ authority data linking (NLP)

○ fuzzy deduplication

○ classification

○ relate bibliographic data to places, topics, abstract terms

○ publish machine readable data (linked data)

● create user interfaces to enable system librarians to control metadata

processing

Page 19: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Tomorrow: common workflows● All data flows through both tools (finc + d:swarm)

● Deduplication (in graphDB easier duplication recognition)

● FRBRization (aggregate different physical and formal versions of a

work)

● Knowledge graph makes enrichment (authorities, altmetrics data,

usage data, …) and analytics easier

Page 20: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Scalability issues● current implementation of property graph is too slow

● test results with 64GB RAM, SSD, 16 cores

○ 1,2 mio records (flat format): 10 hours for complete workflow

(ingest, transformation, export)

○ more complex formats (MARC21) up to 5x statements

● single Neo4j instance, storage and memory issues

Page 21: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

d:swarm architecture

Page 22: (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Possible solutions?● “mit Hardware erschlagen”

● Another graphDB, parallelization?

○ ArangoDB: https://www.arangodb.com

○ Apache Giraph: http://giraph.apache.org

○ Blaze Graph: http://blazegraph.com (Wikidata’s choice)

● Gradoop?!