subject repositories european collaboration in the international context 28-29 january 2010
DESCRIPTION
Subject Repositories European collaboration in the international context 28-29 January 2010. Workshop Technical infrastructure & interoperability Benoit Pauwels Université Libre de Bruxelles, Belgium. Workshop plan. Theme 1: The Economists Online network of data providers - PowerPoint PPT PresentationTRANSCRIPT
Subject RepositoriesEuropean collaboration in the international context
28-29 January 2010
WorkshopTechnical infrastructure & interoperability
Benoit PauwelsUniversité Libre de Bruxelles, Belgium
1
2
• Theme 1: The Economists Online network of data providers• General infrastructure of the EO solution• DIDL/MODS: the EO metadata exchange format• RDF/XML Admin file: decentralized administration• Enrichment of metadata
• Theme 2: Economists Online and RePEc• Pulling metadata from RePEc• Pushing metadata to RePEc• Contribute to LogEC• Use CitEC
Workshop plan
3
• Theme (45’)
• Introduction (BP, 20’)• 3 topics for brainstorming (breakout groups,10’)• Breakout groups reporting back (all, 15’)
Workshop plan
4
• Theme 1: The Economists Online network of data providers
• General infrastructure of the EO solution• DIDL/MODS: the EO metadata exchange format• RDF/XML Admin file: decentralized administration• Enrichment of metadata
The Economists Online network of data providers
Meresco
Metadata
Harvester
Objects
HTTP
Crawler
Metadata
Lucene
EO portal Homemade - FOSS
Exporter engineHomemade - FOSS
Logs
OAI-PMH
OAI-PMH RSS
Other portals
SRU
RePEc
Meresco
Metadata
Harvester
Objects
HTTP
Crawler
Metadata
Lucene
EO portal Homemade - FOSS
Exporter engineHomemade - FOSS
Logs
OAI-PMH
OAI-PMH RSS
Other portals
SRU
RePEc
Metadata exchange format
DIDL / MODSNEEO specs
Usage metadata exchange format
SWUPOFI Comm Profile
7
Technical decisions
Desired EO functionality Technical decision
Facetted search&find experience Normalized/normalizable metadata
APA formatted citations Granular metadata
Publication list per author Unambiguous identification of authors
Full text indexing/searching Unambiguous links to full texts
Enrichment of metadata (JEL, datasets, citations, ReDIF)
Extensible metadata format
8
• XML container structure that can hold semantically distinct metadata• descriptive metadata• object files (by-ref)• splash page• enriched metadata
• JEL• full text (by-ref)• datasets (by-ref)• [ references ]• RePEc handle and metadata (by-ref)
DIDL• Based on existing container structure defined by SurfShare• “info:eu-repo” vocabularies (objectfile accessRights, version, ...)
Metadata exchange format
9
• Granular descriptive metadata MODS (3.2)
• Based on existing metadata structure defined by SurfShare• “info:eu-repo” vocabularies (publication type,
• Unambiguous identification of authors DAI – Digital Author Identifier
• National or institution-unique persistent identifier
• Solutions not specific to the NEEO project; continuous aim of standardization at a level that surpasses the project
Metadata exchange format
DIDL[1]
Item[1]Descriptor/Identifier (persistent identifier)
Item[1..∞] (of type descriptiveMetadata)
Descriptor/type (« descriptiveMetadata »)
Component/Resource -- representation by value (XML)
Item[0..∞] (of type objectFile)
Component/Resource -- representation by ref. (URL)
Descriptor/modified
Descriptor/Identifier (persistent identifier)
Descriptor/modified
Descriptor/type (« objectFile »)
Descriptor/Identifier (persistent identifier)
Descriptor/modified
Item[0..1] (of type humanStartPage)
Component/Resource -- representation by ref. (URL)
Descriptor/type (« humanStartPage »)
EO Data model
• Publication is described as a complex (compound) object
– persistent identifier
• Aggregation of 3 types of components
– descriptiveMetadata (MODS)– objectFiles– humanStartPage
• Extensible– additional items can be stored within the
complex object
• MODS– contains Digital Author Identifier (DAI) of
EO author
11
• Implementations in NEEO
• DIDL application profile• MODS application profile• Vocabularies in DIDL and MODS• Technical guidelines for project partners
• Solutions: home-made or with external support
• ARNO: home-made• Dspace: home-made, AtMire• Eprints: home-made, ECS-University Of Southampton• Fedora: METS/MODS -> DIDL/MODS• DigiTool: METS/MARC -> DIDL/MODS
Metadata exchange format
12
• XML-RDF file
• FOAF + NEEO-specific vocabulary• maintained by each data provider on a local web server• information of institution : name, description, ...• OAI baseURL + OAI sets to harvest• EO authors: photograph, full name, affiliation, DAI
• HTTP get and validated by EO Gateway at regular intervals• Automated harvesting process• Made visible through portal
• New partner
• Create admin file• Ask for registration at [email protected] , declaring location and validating admin
file• If valid, you’re in
Decentralized registry service
Meresco
Metadata
Harvester
Objects
HTTP
Crawler
Metadata
Lucene
EO portal Homemade - FOSS
Exporter engineHomemade - FOSS
Logs
OAI-PMH
OAI-PMH RSS
Other portals
SRU
RePEc
Meresco
Metadata
Harvester
Objects
HTTP
Crawler
Metadata
Lucene
EO portal Homemade - FOSS
Exporter engineHomemade - FOSS
Logs
OAI-PMH
OAI-PMH RSS/Atom
Other portals
SRU
RePEc
SRU
Enrichment service
OAI
-PM
H
15
• “Automated” enrichment – JEL, full-text1. ES gets records to be enriched from EO, over SRU
1. Based on date of request for enrichment of certain type and version
2. Based on flag set in EO record 2. ES creates enrichment record(s)3. ES makes enrichment records available to EO, over OAI-PMH4. EO harvests enrichment records from ES and integrates into original
record5. EO reuses enrichment information in its services: index & present
• “Manual” enrichment – datasets 1) Partner enters permalink of publication on DVN platform2) EO PMH-harvests DDI from DVN, and stores by-ref information
Metadata enrichment
DIDL[1]
Item[1]Descriptor/Identifier (persistent identifier)
Item[1..∞] (of type descriptiveMetadata)
Item[0..∞] (of type objectFile)
Descriptor/modified
Item[0..1] (of type humanStartPage)
Item[0..∞] (of type text)
Item[0..∞] (of type enrichedMetadata)
Item[0..∞] (of type dataset)
EOIR / ES
HTML
TXT
Item[0..∞] (of type review)
Dataset DDI
Review
Descriptor/Identifier (persistent identifier)
Item[1..∞] (of type descriptiveMetadata)
Item[0..∞] (of type objectFile)
Descriptor/modified
Enriched publication
LinkedData / S
emanticWeb / ORE ready
17
» BO Group 1: DIDL/MODS» Scalable? Implementation by 100s of partners» Local experiences from existing partners: implementation issues you want to share? » Can this become a standard for exchange of metadata of IR contained publications?
Where does this stand next to (flavours of) DC, SWAP,...?
» BO Group 2: XML Admin file» Scalable? Implementation by 100s of partners» Local experiences from existing partners: implementation issues you want to share? » DAI?
» BO Group 3: Enrichment model» Extensibility: vocabulary for semantics of components» Manual enrichment: need for enriched submission form, making it easy for people
to make enriched publications» Automated (JEL, full text): sustainable?
Theme 1: The Economists Online network of data providers
18
• Theme 2: Economists Online and RePEc
• Pulling metadata from RePEc• Pushing metadata to RePEc• Contribute to LogEc• Use CitEc
Workshop plan
19
• RePEc archives contain RePEc series contain Working papers, Articles, Books, Book chapters, Software
• Manually maintained by research centres, journal publishers, university departments all over the world• +/- 900 archives, more than 4000 series• ReDIF metadata format
• Network accessible over FTP or HTTP
• Aggregation by RePEc services:• EconPapers• IDEAS• Central PMH-accessible aggregated archive of AMF formatted
metadata
RePEc model
20
Template-type: ReDIF-Paper 1.0Author-Name: Capron, HenriAuthor-Email: [email protected]: Meeusen, WimAuthor-Email: [email protected] Author-Name: Dumont, MichelAuthor-Person: pdu51Author-Name: Cincera, MicheleAuthor-Person: pci5Title: National innovation systems: pilot study of the Belgian innovation systemCreation-Date: 1998Publication-Status: Published as a report for the Belgian Federal Office for Scientific, Technical and Cultural Affairs (OSTC)File-URL: http://bib17.ulb.ac.be:8080/dspace/bitstream/2013/941/1/mc-0048.pdfFile-Format: application/pdfHandle: RePEc:dul:ecoulb:2013-941
RePEc model
21
• Very similar
BUT
• RePEc model: • Harvests only from “official” publisher repositories• Therefore: 1 work exists once in RePEc and it is guaranteed the one and only “official”
manifestation of the work
• IR model: • holds publications for which institution is typically not the publisher• 1 work 1 official manifestation + multiple author manifestations• one work can exist in:
o one or more repositorieso as different publication typeso with different descriptive metadatao with different object files attachedo with different object file metadata
Pushing and pulling metadata records from RePEc and IR into one system is bound to raise problems
RePEc model compared to IR model
22
• EO harvests AMF formatted metadata records from http://oai.repec.openlib.org/
• Overlap !!• Same records are harvested from IR and RePEc• Solution:
• XML Admin file contains directive <not-from-repec-series>• Permits to specify which RePEc series do not need to be
harvested from RePEc, since already delivered through IR• BUT:
• IR contains articles produced by its authors• These articles are contained in a journal RePEc series• Overlap in EO cannot be avoided
Pull metadata from RePEc
23
• EO sets up “RePEc:ner” archive, containing ReDIF-X formatted records• ReDIF-X
• All records are delivered as “ReDIF-Paper”, but with extra fields denoting the “real” publication status and version of text
• Overlap !!• Most institutions already maintain RePEc series: these records must not be
pushed by EO• XML Admin file controls which series to feed in this “ner” archive
• <feed-repec>• boolean: to feed or not to feed
• <feed-repec-series>• If not given: all records with fulltext that are not working
papers are mapped to one series for that institution• RePEc series OAI setspec of DIDL/MODS record
BUT• IR inherent problem of multiple copies/versions is pushed to RePEc
Push metadata to RePEc
24
Template-type: ReDIF-Paper 1.0Title: Block investments and the race for corporate control in BelgiumAuthor-Name: Chapelle, ArianeLanguage: enNote: info:eu-repo/semantics/publishedX-PublishedAs-Type: articleX-PublishedAs-Article-Year: 2004X-PublishedAs-Article-Journal: Corporate Ownership & ControlX-PublishedAs-Article-Volume: 2X-PublishedAs-Article-Issue: 1Order-URL: http://dipot.ulb.ac.be:8080/dspace/handle/2013/9943File-URL: http://dipot.ulb.ac.be:8080/dspace/bitstream/2013/9943/1/ac-0007.pdfFile-Format: application/pdfFile-Version: authorVersionHandle: RePEc:ulb:ecoulb:2013/9943
Push metadata to RePEc: ReDIF-X
25
LogEc
• Aim: track abstract views and download clicks of publications presented through RePEc services (EconPapers, IDEAS, ... Economists Online)
• NOT: tracking of usage at the level of the archives• Downloads of publications contained in RePEc archives, initiated
through a Google user do not show up in LogEc• How:
• EO logs clicks abstract views and download clicks of object files• On a monthly basis, EO transforms these log entries into requested
LogEc format, using “rstat.pl”
2009-10 EconomistsOnline RePEc:aah:aarhec:1987-21 a: 65.55.207.69 66.235.124.10 d: 66.235.124.10
• RePEc handle of publication is necessary EO partners delivering content to RePEc directly (and that EO
therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
26
LogEc
DIDL[1]
Item[1]Descriptor/Identifier (persistent identifier)
Item[1..∞] (of type descriptiveMetadata)
Item[0..∞] (of type objectFile)
Descriptor/modified
Item[0..1] (of type humanStartPage)
Item[0..∞] (of type descriptiveMetadata)
EORePEc
RePEc handleDescriptor/modified
byRef
RePEc (AMF metadata)
27
CitEc
• Aim: citation analysis for RePEc publications• How:
• Analyze text: extract and parse list of references from publications• References are checked whether available in RePEc• Cites:
• references to other RePEc publications• Textual references
• CitedBy• Co-citations
• EO publications (from our IRs) are pushed to RePEc and are therefore pulled through the CitEc processing
• EO has access to the resulting CitEc data, and presents this through the EO portal (not yet, will be in Feb 2010)
• RePEc handle of publication is necessary EO partners delivering content to RePEc directly (and that EO therefore doesn’t
harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
28
» BO Group 1 : Push/pull to/from RePEc» ReDIF-X data structure» Duplicates; different versions of identical publication
» BO Group 2: Publishing models» Advantages/disadvantages of RePEc publishing model as opposed to IR
publishing model» Push the two models together? Do we need to foresee specific services in the
gateway or portal to make these two live together in peace?
» BO Group 3: Future RePEc/EO services» What services should EO and RePEc jointly be looking at in the future in the
interest of the economics researcher ?
Theme 2: Economists Online and RePEc