adlug 2013 - a proposal for an rdf assembly line
DESCRIPTION
A proposal for an RDF assembly line able to convert bibliographic data in RDF using Java, SOLR and Apache CamelTRANSCRIPT
Copyright 2009-2010 @CULT. All rights reserved
Andrea GazzariniSoftware Architect
32st ADLUG ANNUAL MEETING 2013
ARTIUM and the Fundación Sancho El Sabio – Vitoria-Gasteiz
16th – 18th October 2013
A proposal for an RDF assembly line
Frederick W. Taylor
Copyright 2009-2010 @CULT. All rights reserved 2
Agenda
Goals
Intermediate storage Layer
Final storage layer
Chaining all together
Q&A
Copyright 2009-2010 @CULT. All rights reserved 3
Agenda
Goals
Intermediate storage Layer
Final storage layer
Chaining all together
Q&A
Copyright 2009-2010 @CULT. All rights reserved 4
Goals : functional
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 5
Do you remember?
Copyright 2009-2010 @CULT. All rights reserved 6
Goals : non functional
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 7
Agenda
Goals
Intermediate storage Layer
Final storage Layer
Chaining all together
Q&A
Copyright 2009-2010 @CULT. All rights reserved 8
Intermediate storage Layer
We assume that the conversion process takes records in MARC format. This is very important because allows us
– to be completely decopuled from whatever LMS;– to have a set of predefined and standard rules;– a fine level of granularity about the input: the same
process can be executed for one or a for a million of records without any difference at all
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 9
Intermediate storage Layer
Using a MARC binary stream as main input requires an intermediary storage for collecting MARC data.
That will be widely used during the conversion phase. As an example think about the different relationships between records (i.e. entities) expressed in 77X
i.e. while we are processing the record X we should analyze record Y in order to take some decision.
Using a file as only data container that would be very hard because file can be traversed only in sequential mode, and there's no a query language. We should hard code a lookup process.
Using a database as main input source (instead of MARC file / stream) would resolve the query language issue. But each LMS has a different database with different schema
Copyright 2009-2010 @CULT. All rights reserved 10
OseeGenius with M(arc)QL
INDEX
(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)
SEARCH
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
Copyright 2009-2010 @CULT. All rights reserved 11
MQL examples
(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)
245a=* AND 245c=Potter AND 260b=G?unt?245a=* AND 245c=Potter AND 260b=G?unt?
001=(27892 27290 29900 246923)001=(27892 27290 29900 246923)
001=(27892 OR 27290 OR 29900 OR 246923)001=(27892 OR 27290 OR 29900 OR 246923)
005=[1995 TO 199502231]005=[1995 TO 199502231]
100a=Morgan AND 100e=(interviewer OR collector)100a=Morgan AND 100e=(interviewer OR collector)
Copyright 2009-2010 @CULT. All rights reserved 12
MQL result (example)
Search metadata
Search result (first record)
Copyright 2009-2010 @CULT. All rights reserved 13
Agenda
Goals
Final storage Layer
Intermediate storage layer
Chaining all together
Q&A
Copyright 2009-2010 @CULT. All rights reserved 14
Final storage layer
Even if the final result could be directed to a local file, a more appropriate destination would be a triple store, where we can have the following benefits:
1) a standard Query language (SPARQL) to query the store;
2) a standard format for exchanging data (RDF);
3) a storage where you are free to change your data in realtime without doing any kind of reindex operation;
RDF Stream
TRIPLE STORE
Copyright 2009-2010 @CULT. All rights reserved 15
Agenda
Goals
Information Retrieval
Triple store
Chaining all together
Q&A
Copyright 2009-2010 @CULT. All rights reserved 16
Introducing Apache Camel
Apache Camel ™ is a versatile open-source integration framework based on known Enterprise Integration Patterns.
Camel empowers you to define routing and mediation rules in a variety of domain-specific languages, including a Java-based Fluent API, Spring or Blueprint XML Configuration files, and a Scala DSL
Copyright 2009-2010 @CULT. All rights reserved 17
Introducing Apache Camel
In practice, Apache Camel let us define this kind of things:
Basically the whole process is divided in atomic pieces (called “processors”) and each of them is responsible for a little part of the overall work.
Each processor can act as a splitter or aggregator, can do some content manipulation or in general can perform some task on the incoming message.
Copyright 2009-2010 @CULT. All rights reserved 18
Indexing processor
INDEX
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
Once did that, the storage contains MARC record and provides MQL capabilities.
Note that this is a Near Real Time (NRT) engine so records can be added on the fly and made immediately visible to searchers.
Last but not least, although we tagged this storage as “intermediate” that doesn't mean it is transient. That is, data is persisted and can be reused for further indexing processes.
Copyright 2009-2010 @CULT. All rights reserved 19
Split processor
000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.
000 0
0694nam
a2200241 i 4
500
008 971205s1997
it j
000 0 ita
c
020 $a 880921191X
082
1 $a
853
.8 100 1 $a Collodi, Carlo.
245 13$a Le avventure di Pinocchio /$c C. Collodi ; illustrazioni di A. Mussino
260 $
a Firen
ze :$b G
iunti,$c 1997.
440 0
$a C
ollana favolo
sa / [
Giu
nti
]
521 $a Lette
ratura per ragazzi 7
00 1
$a M
ussin
o,
Att
ilio
Each “piece” will have a record fragment and a special correlation ID
001 00122721727100 1 $a Collodi, Carlo
000 0
0694nam
a2200241 i 4
500
260 $
a Firen
ze :$b G
iunti,$c 1997.
260 $
a Firen
ze :$b G
iunti,$c 1997.
008 971205s1997 i
t j
000 0 ita
c
100 1 $a Collodi, Carlo.
Copyright 2009-2010 @CULT. All rights reserved 20
Transformation processor (1/2)
001 27283020 1 $a880921191X
A processor that operates on a fragment (i.e. a tag) of MARC record for producing an RDF triple.
Obviously this is a family of processors because different logics can be applied depending on the incoming tag. In the most simplicistic case a tag can be directly translated to a triple.
<atcult:27283> <bibo:isbn>“880921191X”
Other times from a single tag we could generate a different (dependent) entity and a reference within the main entity. If the dependent entity has been already created by another similar tag then the processor will create just a reference.
001 27283020 1 $aCollodi Carlo.
<atcult:029100> <foaf:name>“Collodi Carlo”
<atcult:27283> <dc:creator><atcult:029100>
Copyright 2009-2010 @CULT. All rights reserved 21
Transformation processor (2/2)
001 27283773 1 $aThe Hug
We could have a transformer that, in order to produce results (one or more triple), needs to find information about another record. Think for example at 77X relation tags.
How to get those information? In the incoming message we have just a correlation ID (the 001) and a tag
Here comes the intermediate storage with MQL capabilities
<atcult:27283> <frbr:Work><atcult:92827>
<atcult:27283><bibo:DocumentPart><atcult:92827>
245a=The Hug245a=The Hug MARC XML
Copyright 2009-2010 @CULT. All rights reserved 22
The big picture
SOLR1 SOLR2 SOLR3 SOLR4
1
1 1
2
2
2
3
3
3
3
4
4 4
LOAD BALANCER4 . 4 . x . 4 . 2 In d ic e d iv is o o r iz z o n t a lm e n t e
Copyright 2009-2010 @CULT. All rights reserved 23
An example (1/5)
Copyright 2009-2010 @CULT. All rights reserved 24
An example (2/5)
Copyright 2009-2010 @CULT. All rights reserved 25
An example (3/5)
Copyright 2009-2010 @CULT. All rights reserved 26
An example (4/5)
Copyright 2009-2010 @CULT. All rights reserved 27
An example (5/5)
Copyright 2009-2010 @CULT. All rights reserved 28
Agenda
Goals
Information Retrieval
Triple store
Chaining all together
Q&A
A proposal for an RDF assembly line
32st ADLUG ANNUAL MEETING 2013
ARTIUM and the Fundación Sancho El Sabio - Vitoria-Gasteiz16th – 18th October 2013
Thank You!