adlug 2013 - a proposal for an rdf assembly line

29
Copyright 2009-2010 @CULT. All rights reserved Andrea Gazzarini Software Architect 32st ADLUG ANNUAL MEETING 2013 ARTIUM and the Fundación Sancho El Sabio – Vitoria-Gasteiz 16th – 18th October 2013 A proposal for an RDF assembly line Frederick W. Taylor

Upload: andrea-gazzarini

Post on 24-Apr-2015

168 views

Category:

Technology


0 download

DESCRIPTION

A proposal for an RDF assembly line able to convert bibliographic data in RDF using Java, SOLR and Apache Camel

TRANSCRIPT

Page 1: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved

Andrea GazzariniSoftware Architect

32st ADLUG ANNUAL MEETING 2013

ARTIUM and the Fundación Sancho El Sabio – Vitoria-Gasteiz

16th – 18th October 2013

A proposal for an RDF assembly line

Frederick W. Taylor

Page 2: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 2

Agenda

Goals

Intermediate storage Layer

Final storage layer

Chaining all together

Q&A

Page 3: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 3

Agenda

Goals

Intermediate storage Layer

Final storage layer

Chaining all together

Q&A

Page 4: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 4

Goals : functional

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 5: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 5

Do you remember?

Page 6: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 6

Goals : non functional

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 7: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 7

Agenda

Goals

Intermediate storage Layer

Final storage Layer

Chaining all together

Q&A

Page 8: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 8

Intermediate storage Layer

We assume that the conversion process takes records in MARC format. This is very important because allows us

– to be completely decopuled from whatever LMS;– to have a set of predefined and standard rules;– a fine level of granularity about the input: the same

process can be executed for one or a for a million of records without any difference at all

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 9: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 9

Intermediate storage Layer

Using a MARC binary stream as main input requires an intermediary storage for collecting MARC data.

That will be widely used during the conversion phase. As an example think about the different relationships between records (i.e. entities) expressed in 77X

i.e. while we are processing the record X we should analyze record Y in order to take some decision.

Using a file as only data container that would be very hard because file can be traversed only in sequential mode, and there's no a query language. We should hard code a lookup process.

Using a database as main input source (instead of MARC file / stream) would resolve the query language issue. But each LMS has a different database with different schema

Page 10: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 10

OseeGenius with M(arc)QL

INDEX

(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)

SEARCH

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 11: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 11

MQL examples

(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)(245a=History AND 1XX=Potter) OR (260b=Addison Wesley)

245a=* AND 245c=Potter AND 260b=G?unt?245a=* AND 245c=Potter AND 260b=G?unt?

001=(27892 27290 29900 246923)001=(27892 27290 29900 246923)

001=(27892 OR 27290 OR 29900 OR 246923)001=(27892 OR 27290 OR 29900 OR 246923)

005=[1995 TO 199502231]005=[1995 TO 199502231]

100a=Morgan AND 100e=(interviewer OR collector)100a=Morgan AND 100e=(interviewer OR collector)

Page 12: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 12

MQL result (example)

Search metadata

Search result (first record)

Page 13: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 13

Agenda

Goals

Final storage Layer

Intermediate storage layer

Chaining all together

Q&A

Page 14: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 14

Final storage layer

Even if the final result could be directed to a local file, a more appropriate destination would be a triple store, where we can have the following benefits:

1) a standard Query language (SPARQL) to query the store;

2) a standard format for exchanging data (RDF);

3) a storage where you are free to change your data in realtime without doing any kind of reindex operation;

RDF Stream

TRIPLE STORE

Page 15: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 15

Agenda

Goals

Information Retrieval

Triple store

Chaining all together

Q&A

Page 16: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 16

Introducing Apache Camel

Apache Camel ™ is a versatile open-source integration framework based on known Enterprise Integration Patterns.

Camel empowers you to define routing and mediation rules in a variety of domain-specific languages, including a Java-based Fluent API, Spring or Blueprint XML Configuration files, and a Scala DSL

Page 17: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 17

Introducing Apache Camel

In practice, Apache Camel let us define this kind of things:

Basically the whole process is divided in atomic pieces (called “processors”) and each of them is responsible for a little part of the overall work.

Each processor can act as a splitter or aggregator, can do some content manipulation or in general can perform some task on the incoming message.

Page 18: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 18

Indexing processor

INDEX

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Once did that, the storage contains MARC record and provides MQL capabilities.

Note that this is a Near Real Time (NRT) engine so records can be added on the fly and made immediately visible to searchers.

Last but not least, although we tagged this storage as “intermediate” that doesn't mean it is transient. That is, data is persisted and can be reused for further indexing processes.

Page 19: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 19

Split processor

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di A. Mussino260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

000 0

0694nam

a2200241 i 4

500

008 971205s1997

it j

000 0 ita

c

020 $a 880921191X

082

1 $a

853

.8 100 1 $a Collodi, Carlo.

245 13$a Le avventure di Pinocchio /$c C. Collodi ; illustrazioni di A. Mussino

260 $

a Firen

ze :$b G

iunti,$c 1997.

440 0

$a C

ollana favolo

sa / [

Giu

nti

]

521 $a Lette

ratura per ragazzi 7

00 1

$a M

ussin

o,

Att

ilio

Each “piece” will have a record fragment and a special correlation ID

001 00122721727100 1 $a Collodi, Carlo

000 0

0694nam

a2200241 i 4

500

260 $

a Firen

ze :$b G

iunti,$c 1997.

260 $

a Firen

ze :$b G

iunti,$c 1997.

008 971205s1997 i

t j

000 0 ita

c

100 1 $a Collodi, Carlo.

Page 20: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 20

Transformation processor (1/2)

001 27283020 1 $a880921191X

A processor that operates on a fragment (i.e. a tag) of MARC record for producing an RDF triple.

Obviously this is a family of processors because different logics can be applied depending on the incoming tag. In the most simplicistic case a tag can be directly translated to a triple.

<atcult:27283> <bibo:isbn>“880921191X”

Other times from a single tag we could generate a different (dependent) entity and a reference within the main entity. If the dependent entity has been already created by another similar tag then the processor will create just a reference.

001 27283020 1 $aCollodi Carlo.

<atcult:029100> <foaf:name>“Collodi Carlo”

<atcult:27283> <dc:creator><atcult:029100>

Page 21: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 21

Transformation processor (2/2)

001 27283773 1 $aThe Hug

We could have a transformer that, in order to produce results (one or more triple), needs to find information about another record. Think for example at 77X relation tags.

How to get those information? In the incoming message we have just a correlation ID (the 001) and a tag

Here comes the intermediate storage with MQL capabilities

<atcult:27283> <frbr:Work><atcult:92827>

<atcult:27283><bibo:DocumentPart><atcult:92827>

245a=The Hug245a=The Hug MARC XML

Page 22: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 22

The big picture

SOLR1 SOLR2 SOLR3 SOLR4

1

1 1

2

2

2

3

3

3

3

4

4 4

LOAD BALANCER4 . 4 . x . 4 . 2 In d ic e d iv is o o r iz z o n t a lm e n t e

Page 23: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 23

An example (1/5)

Page 24: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 24

An example (2/5)

Page 25: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 25

An example (3/5)

Page 26: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 26

An example (4/5)

Page 27: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 27

An example (5/5)

Page 28: ADLUG 2013 - A proposal for an RDF assembly line

Copyright 2009-2010 @CULT. All rights reserved 28

Agenda

Goals

Information Retrieval

Triple store

Chaining all together

Q&A

Page 29: ADLUG 2013 - A proposal for an RDF assembly line

A proposal for an RDF assembly line

32st ADLUG ANNUAL MEETING 2013

ARTIUM and the Fundación Sancho El Sabio - Vitoria-Gasteiz16th – 18th October 2013

Thank You!