neeo project ec final review meeting gateway and portal 23 march 2010 benoit pauwels université...
TRANSCRIPT
NEEO project
EC Final review meetingGateway and portal
23 March 2010
Benoit PauwelsUniversité Libre de Bruxelles, Belgium
1
2
• Overview of technical infrastructure
• EO as a network of data providers – descriptive metadata
• EO as a network of data providers – usage statistics
• Added value services• Publication lists• Enriched metadata• Full-text searching• Multilinguality
• Collaboration with RePEc
• EO gateway and portal
Plan
Meresco
Metadata
Harvester
Objects
HTTP
Crawler
Metadata
Lucene
EO portal Homemade - FOSS
Exporter engineHomemade - FOSS
Logs
OAI-PMH
OAI-PMH RSS/Atom
Other portals
SRU
RePEc
SRU
Enrichment service
OA
I-PM
H
DIDL / MODS SWUP
4
Descriptive metadata exchange format
Desired EO functionality Technical decision
Facetted search&find experience Normalized/normalizable metadata
APA formatted citations Granular metadata
Publication list per EO author Unambiguous identification of authors
Full text indexing/searching Unambiguous links to full texts
Enrichment of metadata (JEL, datasets, citations)
Extensible metadata format
5
• DIDL – XML container structure that can hold semantically distinct metadata• Descriptive, object files (by-ref), splash page, enriched metadata • Based on existing container structure defined by SurfShare
• MODS (3.2) – granular descriptive metadata• Based on existing metadata structure defined by SurfShare
• DAI – Unambiguous identification of authors• National or institution-unique persistent identifier
• Continuous aim of standardization at a level that surpasses the NEEO project• NEEO adaptations fed back to SurfShare
Descriptive metadata exchange format
DIDL[1]
Item[1]
Descriptor/Identifier (persistent identifier)
Item[1..∞] (of type descriptiveMetadata)
Descriptor/type (« descriptiveMetadata »)
Component/Resource -- representation by value (XML)
Item[0..∞] (of type objectFile)
Component/Resource -- representation by ref. (URL)
Descriptor/modified
Descriptor/Identifier (persistent identifier)
Descriptor/modified
Descriptor/type (« objectFile »)
Descriptor/Identifier (persistent identifier)
Descriptor/modified
Item[0..1] (of type humanStartPage)
Component/Resource -- representation by ref. (URL)
Descriptor/type (« humanStartPage »)
EO descriptive metadata model
• Publication is described as a complex (compound) object– persistent identifier
• Aggregation of 3 types of components– descriptiveMetadata (MODS)– objectFiles– humanStartPage
• Extensible– additional items can be stored within
the complex object
• MODS contains DAI of EO author
• Semantic Web - Linked Data – OAI-ORE ready
7
• Central EO gateway
• DIDL and MODS application profiles• Vocabularies in DIDL and MODS
• Technical guidelines for project partners• All documentation is OA available
• Partner solutions: home-made or with external support
• ARNO home-made• Dspace home-made, AtMire• Eprints home-made, ECS-University Of Southampton• Fedora METS/MODS -> DIDL/MODS• DigiTool METS/MARC -> DIDL/MODS
• All original partners + 2 new partners
Descriptive metadata exchange format
8
• Aim: sustainable solution for big network with many partners
• Decentralized Admin file
• Format XML-RDF | FOAF + NEEO-specific vocabulary• Decentralized file sits on local web server of project partner• Content - information of institution : name, description, ...
- OAI baseURL + OAI sets to harvest- EO authors: DAI, photograph, full name, affiliation
• EO gateway HTTP gets and validates at regular intervals• Used for - information in EO portal screens
- publication lists (match on DAI)- automated harvesting process
Decentralized registry service
9
Usage statistics – EO use case• EO use case: present download rates through EO portal per publication,
scholar, institution
• Normalization of exchange format and communication protocolOAI-PMH exchange of SWUP OpenURL ContextObjects (Scholarly Works Usage Community Profile)
•Special considerations:• Enryption of IP address of requester (MD5)• Filtering out robot requests (list of 50 regular expressions)• Filtering out double clicks
• Similar initiatives come together at Knowledge Exchange workshop, Berlin 29-30 March 2010• JISC (Usage Statistics Review project), Pirus2, SurfSure, Counter, Mesur,
OA-Statistik, Economists Online
10
Usage statistics – implementation status• Central EO Gateway – DoDoCo (Document Download Counter)
• PMH harvesting of SWUP ContextObjects into SQL database• Enrich with information on item, scholar, institution• Web servicelevel (item, scholar, institution) + date range
• Technical guidelines for project partners (OA available)
• Partners
• Implementation - for all major IR platforms- solution for Combined Log Format web logs
• Registration through Admin file• 7 original + 1 new partner
• Not enough data available
• Not visible through EO portal yet, although DoDoCo software is ready
12
• Publication lists
• Per DAI of authors who are registered in Admin file
• SRU extract publications from EO gateway and Format• APA+ in HTML
• with links to full text in EO partner repository• with links to publisher sites (through OpenURL resolution)
• APA in PDF• APA in RTF• RIS• BibTex
Added value services
13
• Enriched descriptive metadata
• JEL classification
• Enrichment service (ES) gets records to be enriched from EO, over SRU• ES creates enrichment record(s), using text mining technology• ES makes enrichment record(s) available to EO, over OAI-PMH• EO harvests enrichment records from ES and integrates into original record• EO reuses enrichment information in its services: index & present
• Bibliographic references
• Through collaboration with RePEc/CitEc
• Visible through EO portal
Added value services
14
• Full-text search service
• Process
• Full-text indexer component in Meresco fetches relevant records from EO Gateway over SRU
• Follow links to PDF object files • Text is extracted from PDF, and added to record through SRU
Update • EO can now index & present
• Prototype exists
• Not yet fully deployed in EO portal
Added value services
15
• Multilinguality (EN, FR, GE, ES)
• Complete EO portal interface• JEL classification• MLIA functionality in EO portal
• Student thesis – Prof. Bouillon (Univ. Of Geneva -- multilingual information processing department )• (uncustomized) Systran and Google Translate show equivalent results
• Contacts with CACAO (also through Europeana)• comes as a complete portal solution, not as an add-in for existing portals
like EO• Considerations:
• Lingua franca in economics = EN• NEEO = NOT research project in linguistics, aim: reuse best existing
technology Use “Google Translate” for translation of queries
Added value services
16
• Harvesting metadata from RePEc into EO• AMF to DIDL/MODS mapping
• Push metadata from EO to RePEc• “RePEc:ner” archive, with separate series for each EO institution• According to agreed-upon reviewed ReDIF format
Admin file directives in order to limit overlap
• Contribute to LogEc
• Reuse CitEc data in EO portal
Collaboration with RePEc
17
• Gateway – metadata store and search engine • Choice between Summa, SOLR/Lucene, Meresco• Open source solution, based on Lucene search engine • Support available from software developers (CQ2 company)• Has proven its qualities in the past (DARENet)
• Portal• First version: home-made• Final version:
• outsourced design to private company• HTML, CSS, JavaScript, all images
EO gateway and portal