rimcrispool_finalreport

55
Project Acronym: CRISPool Version: 2.2 Contact: [email protected] Date: 01/12/2010 Page 1 of 55 JISC Final Report Project Information Project Acronym CRISPool Project Title Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal Start Date 1 March 2010 End Date 31 August 2010 Lead Institution University of St Andrews Project Director Anna Clements Project Manager & contact details Anna Clements [email protected] 01334 462761 Partner Institutions SUPA (The Scottish Universities Physics Alliance) http://www.supa.ac.uk/ University of Edinburgh University of Glasgow EuroCRIS http://www.eurocris.org Atira A/S http://www.atira.dk Project Web URL Pilot Portal http://www.crispool.org http://crispool.atira.dk/portal Programme Name (and number) Information Environment Programme 2009-2011 Research Information Management Call 11/09 Programme Manager Neil Jacobs / Frederique Van Till Document Name Document Title Final Report Reporting Period Author(s) & project role Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer) Date Filename URL if document is posted on project web site Access √Project and JISC internal General dissemination Document History Version Date Comments V1.0 31/08/2010 Circulated to partners and programme manager V2.0 09/09/2010 Amendments from partners plus Appendices V2.1 16/09/2010 Final version to JISC

Upload: jisc-infonet

Post on 09-Mar-2016

213 views

Category:

Documents


1 download

DESCRIPTION

Document History Version Date Comments V1.0 31/08/2010 Circulated to partners and programme manager V2.0 09/09/2010 Amendments from partners plus Appendices V2.1 16/09/2010 Final version to JISC JISC Final Report Information Environment Programme 2009-2011 Research Information Management Call 11/09 Programme Manager Neil Jacobs / Frederique Van Till Programme Name (and number) http://www.crispool.org http://crispool.atira.dk/portal Page 1 of 55

TRANSCRIPT

Project Acronym: CRISPool Version: 2.2 Contact: [email protected] Date: 01/12/2010

Page 1 of 55

JISC Final Report

Project Information Project Acronym CRISPool Project Title Using CERIF-XML to integrate heterogeneous research information from

several institutions into a single portal Start Date 1 March 2010 End Date 31 August 2010 Lead Institution University of St Andrews Project Director Anna Clements Project Manager & contact details

Anna Clements [email protected] 01334 462761

Partner Institutions SUPA (The Scottish Universities Physics Alliance) http://www.supa.ac.uk/ University of Edinburgh University of Glasgow EuroCRIS http://www.eurocris.org Atira A/S http://www.atira.dk

Project Web URL Pilot Portal

http://www.crispool.org http://crispool.atira.dk/portal

Programme Name (and number)

Information Environment Programme 2009-2011 Research Information Management Call 11/09

Programme Manager Neil Jacobs / Frederique Van Till

Document Name

Document Title Final Report Reporting Period Author(s) & project role Anna Clements (Project Manager)

Niall Lockhart (Project Management Support Officer) Date Filename URL if document is posted on project web site Access √Project and JISC internal √ General dissemination

Document History Version Date Comments

V1.0 31/08/2010 Circulated to partners and programme manager V2.0 09/09/2010 Amendments from partners plus Appendices V2.1 16/09/2010 Final version to JISC

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 2 of 55

JISC Final Report

CRISPool Project

Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal

Author(s): Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer)

Contact Anna Clements [email protected] University of St Andrews Business Improvements Butts Wynd Building St Andrews Fife KY16 9AD

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 3 of 55

Table of Contents

............................................................................................................. 1

JISC FINAL REPORT ................................................................................................ 1

............................................................................................................. 2

JISC FINAL REPORT ................................................................................................ 2

.................................................................................................................................................... 2 Acknowledgements ............................................................................................................................................ 4 Executive Summary ............................................................................................................................................ 5 Background ......................................................................................................................................................... 6 Aims and Objectives ........................................................................................................................................... 7 Methodology ....................................................................................................................................................... 7 Implementation ................................................................................................................................................... 9 Sourcing the data .............................................................................................................................................. 11 Producing the CERIF-XML .............................................................................................................................. 11 Outputs and Results .......................................................................................................................................... 14 Outcomes .......................................................................................................................................................... 16 Conclusions ...................................................................................................................................................... 18 Implications ...................................................................................................................................................... 18 References ........................................................................................................................................................ 19 Appendix 1: CRISPool Data Dictionary ........................................................................................................... 21 Appendix 2 : Class Scheme Data ...................................................................................................................... 37 Appendix 3: CRISPool CERIF to PURE4 mapping ......................................................................................... 44 Appendix 4 : Technical Summary - CRISPool project prototype implementation ........................................... 53

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 4 of 55

Acknowledgements The CRISPool project would like to acknowledge the contributions of the following organisations to the success of the project:

• JISC for part-funding the project through the Information Environment Programme 2009-11 and it’s Research Information Management Call 11/09

• The project partners for their invaluable contributions to the project: o SUPA1

o University of Glasgow (The Scottish Universities Physics Alliance)

o University of Edinburgh o EuroCRIS2

o Atira A/S

3

• The ERIS

4 and R4R5

1 www.supa.ac.uk 2 www.eurocris.org 3 www.atira.dk 4 http://eriscotland.wordpress.com/ 5 http://www.kcl.ac.uk/iss/cerch/projects/portfolio/r4r.html

(Readiness4Ref) project teams for continuing enthusiastic support and advice.

Project Acronym: CRISPool Version: 2.2 Contact: [email protected] Date: 01/12/2010

Page 5 of 55

Executive Summary We have successfully used CERIF-XML to bring together data on people, organisations and publications from three Universities for the SUPA [Scottish Universities Physics Alliance] research pool. These data are viewable and searchable at http://crispool.atira.dk/portal This was the main aim of the project and has been achieved within the limited timescale and budget of this JISC call. The collaborative aspect of the project involving partner institutions, pool administrators, euroCRIS, third party developers, Atira, and related JISC-funded projects, Readiness4REF(R4R) and ERIS has meant that a wide number of stakeholders have been involved at all stages to help ensure the success of the project. The approach taken meant that we were learning how to use CERIF-XML as we went along so the expert help and advice of euroCRIS and Atira who are members of the euroCRIS CERIF Task Group and the sharing of preliminary findings from the Readiness4Ref project led by Kings College, London have been invaluable. Additionally, the enthusiastic support from the ERIS project has provided a channel to other pools in Scotland; several of whom have expressed interested in the project. The basic steps, once we had agreed on which data the partner Institutions (Glasgow and Edinburgh) could reasonably provide within the timescale, were that the University of St Andrews created some sample CERIF-XML files for the other University partner institutions which would allow them to generate the data needed for the portal. Each institution took a different approach to generating their XML data but all used relatively low-tech text editing and search and replace tools. No additional specialist knowledge was required. Although the main aim of the project was to test the suitability of CERIF-XML as an exchange format, it was evident that those Institutions with an existing culture of integrated research information management were better able to provide the required data quickly. For St Andrews there was no additional work required as all data were fed in from their existing CRIS. Glasgow, which has had an in-house integrated research information management system for many years were able to provide data on people and publications easily. Edinburgh were able to provide data on people but unfortunately were not able to provide publications data within the project timeframe. Returning to the main aim of testing CERIF-XML’s suitability as an exchange format, the CERIF data model fully supported the requirements of the project except for two relatively minor areas which have been reported to euroCRIS. For the pilot project we have been able to workaround these issues by using CERIF classifications; something that R4R has also been able to do during the exercise to map RAE2008 schema to CERIF. The main technical issue we have found is to do with the fragmentation of CERIF-XML into so many individual xml files. The sheer number means that it is very resource intensive to process as each item, whether a person or organisation or publication is defined by data in up to 10 related xml files. The issue facing the designers of CERIF is that the model itself needs to represent the real world of interrelated research information – the fully connected graph; however XML is a linearised tree structure and cannot natively represent the complexity required. However, XML is also the vehicle of choice for data exchange in web services. In conclusion all partners are positive about the results of the CRISPool project and SUPA are keen to move forward from a pilot to a sustainable solution. We see that while there are still areas to improve on (for example the processing of multiple xml files) the sector as a whole can take heart from our findings that reinforce the conclusion from the EXRI report that CERIF should be used as the exchange format within the UK research information sector.

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 6 of 55

Background The importance of collecting, maintaining and exchanging good quality, comprehensive and current research information has risen up the agenda in the UK Higher Education sector following the recently completed RAE2008 data collection exercise. In particular, the use of a standard exchange format, the Common European Research Information Format (CERIF), to improve interoperability of data between the different stakeholders (Funding Councils, Research Councils and other funders, HESA, Institutions) has been discussed by a JISC-led Research Information Management Group. This group commissioned the EXRI project to examine the suitability of CERIF versus other possible standards, or no standard. The final report6

Research Pooling is well established in Scotland and there are currently 13 pools

recommends the use of CERIF as a standard exchange format between the stakeholders. CRISPool builds directly on Recommendation 7 from this report ‘.. pilots to look at real exchange of research activity data between HEIs using CERIF’.

7

This approach is also being discussed in the national UK press as reported in

– the oldest being SUPA, established in 2005 The pools were setup in order to help create and maintain a critical mass of resources needed for Scotland’s universities to carry out world-class research. The success of the initiative was highlighted by the RAE2008 results in which Scottish institutions increased their share of the UK's world-class research from 11.6% in 2001 to 12.3%, even though the country has only 8.5% of the UK population. Every Scottish institution now has world leading research in at least one of its disciplines.

Times Higher Education8

In SUPA information on staff and students is used to provide access to the

, THE, 5th August 2010 which presented the views of David Price, Vice-Provost for Research and Stephen Caddick, Vice-Provost for Enterprise, both of at University College London :

‘ … the coming cuts to the sector will necessitate "major restructuring" to preserve the global standing of the elite universities on which the success of UK higher education depends.

The elite, they propose, should pool and coordinate their research strengths to form hubs of about half a dozen regional "research clusters". The current information infrastructure underpinning SUPA, as with the other pools, is poor and much resource and duplication of effort is spent by both SUPA administrators and members of the partner institutions in collecting and checking data on staff, students and publications. This information is held at member institutions in different formats with different vocabularies used, for example, for similar job descriptions or publication types. This information is collated and presented for reporting to the Scottish Funding Council.

My.SUPA9 portal, a virtual learning environment and research collaboration portal based on Moodle10

Gathering publications information has been particularly resource intensive. In previous years it has been carried out by requesting information from department administrators passing data using spreadsheets. SUPA administrators verified the information, emailing each staff member or research student, and provided opportunities to make corrections and additions prior to publication of a printed publications list. This final publications list is only available in a limited electronic form : a PDF version

. Logged in staff and students are given access to lists of other users along with some limited profile information listing interests in various research themes with the aim of fostering new collaborations between researchers across Scotland.

6 http://www.jisc.ac.uk/publications/briefingpapers/2010/bpexriv1.aspx#downloads 7 http://www.sfc.ac.uk/research/researchpools/researchpools.aspx 8 http://www.timeshighereducation.co.uk/story.asp?storycode=412909 9 http://my.supa.ac.uk/ 10 http://www.moodle.org

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 7 of 55

of the printed copy. This monolithic document does not provide any other ways to analyses information except in the order and format provided, and a simple text search in the PDF document. Data supplied by the departments is of varying quality, and is sometimes provided indirectly rather than being sourced from institutional information systems. The complete dataset is based on information from different information systems with different business rules and data constraints. Feedback from several department administrators involved in these requests for information suggest that data is gathered from a variety of sources including institutional systems, departmental information systems or local files. This approach responds to the immediate query expeditiously, rather than following a repeatable process. Several requests and clarifications may be required in each information gathering exercise. The data is generally only updated on an annual basis, and not consistently maintained between annual reporting cycles and so quickly goes out of date, therefore considerably less useful. The CRISPool project partners have been working in the area of research information for several years including innovative projects to link research information and management systems to open access repositories. Glasgow University has developed an innovative integrated research management system, the University of Edinburgh is leading a consortial approach to driving the open access agenda forward in Scotland and the University of St Andrews, in a joint project with the University of Aberdeen, is the first UK institution to implement the CERIF-based CRIS [Current Research Information System] product (PURE), made by the Danish company, Atira. A key stipulation by both St Andrews and Aberdeen, and supported by Atira, is that the conceptual data model developed for the UK should be made available to other UK Institutions implementing or investigating a CERIF-CRIS independent of which system they choose.

Aims and Objectives CRISPool builds on the experience gained by the partners and their desire to work with other Institutions to find practical ways of reducing the overall burden or research information management across the sector. The implementation of PURE has demonstrated the suitability of CERIF for capturing research information internally within the two Institutions (St Andrews and Aberdeen). The CRISPool project had the following aims:

• To demonstrate that CERIF-XML can be used to bring data from heterogeneous, cross institutional sources together.

• To provide evidence of the benefits and costs of adopting CERIF-XML as a cross-institutional data exchange format.

The aims were to be through the main objective:

• To build an initial portal exposing these data on the web with basic search & retrieve functionality and basic technical exhibition of data (e.g. fetching data via RSS, XML/SOAP, OAI).

Whilst testing the suitability of CERIF-XML was the primary focus of this project, there was also an expectation that organisational and information systems changes would occur as a direct result of the need to ensure data is up to date, sufficiently accurate and meets the commonly agreed criteria. The results from CRISPool are both transferable (to other exchange scenarios) and scaleable (to other CERIF elements not included in this project). These aims and objective remained constant throughout the project and the Outputs and Results section below discusses the degree to which they have been met.

Methodology With a six month project and multiple partners the methodology used was to build on existing expertise : euroCRIS with their in depth knowledge of CERIF; Atira with their existing PURE CRIS

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 8 of 55

product and established expertise in CERIF-based CRIS and the partner Institutions [St Andrews, Glasgow and Edinburgh] with their experience in the area of research information and repository systems. The project was split into three strands: 1. Scoping and Investigation: defining data model (entities, relationships, constraints), common vocabularies (people, publications, organisations) to meet SUPA requirement for annual publications report. This strand also identified data sources and determined any limitations necessary due to data availability. Due to the limited time span and resource available to the project at the partner institutions we kept the data requirements to a minimum to meet what could be provided by Glasgow and Edinburgh and was still useful to SUPA. Thus Glasgow started with their dataset produced for the REF Bibliometrics Pilot project in 2008-9. This data set already linked outputs to staff using the institutional ID. Edinburgh aimed to provide all current academic staff in the School of Physics and Astronomy and then match them against publications data from the Edinburgh Research Archive [ERA]. A comprehensive set of publications data related to current academics in the School of Physics and Astronomy at St Andrews was provided from the PURE CERIF-CRIS database in CERIF-XML format.

Figure 1 : Summary of data flow in CRISPool 2. Technical delivery: configure and install PURE for the defined data model and build CERIF-XML integrator; data sources mapped to CERIF-XML to produce single or multiple data streams for integration into PURE; a simple portal was built to expose data via web pages, web services and RSS feeds.

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 9 of 55

Again, because of the short timeframe and limited budget existing technology [PURE4 product] formed the backend database and administrator functionality. Atira built CERIF-XML export [to export St Andrews data to CERIF-XML] and import functions [to import St Andrews, Glasgow and Edinburgh CERIF-XML] as add-ons to the PURE4 product. Each function was triggered through the administrative interface on an ad-hoc basis or using standard cron job configuration. Finally, a simple portal was created based on the current SUPA website design. 3. Engagement and Evaluation: conduct a base line review during SUPA annual data collection round; time and effort to identify sources and map to CERIF; advantages and disadvantages; ongoing engagement with regional, national and European projects and groups e.g. ERIS led by Edinburgh [project manager is member of CRISPool], Enquire led by Glasgow [ditto], Readiness4Ref (led by KCL), UCISA, WRN/ARMA, euroCRIS. The engagement strand ran throughout the project and is continuing, for example, at the Repository Fringe, Sep 2010 at Edinburgh. Due to resource issues at SUPA a full base line review was not carried out however feedback from SUPA staff has been incorporated into the Outputs and Results section.

Implementation Two workshops were held early in the project [March and April 2010] to familiarise all partners with the CERIF model; finalise the data requirements taking into account what was achievable over the short timescale and also useful to SUPA, and share the experiences of the R4R project in mapping RAE2008 to CERIF-XML. We also agreed to create institution-specific unique IDs for the organisations, persons and publications being brought together into CRISPool by using the UK Learner Provider number as a prefix to institutional IDs. The scope of CRISPool did not allow for any time to deduplicate/merge data on publications and, potentially, people. A CRISPool project was created within the existing ERIS project online collaboration tool, Basecamp11

11 http://basecamphq.com/

to help plan and manage the project.

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 10 of 55

Figure 2: Elements of CERIF used in CRISPool

Figure 1 shows the basic CERIF model with the elements used in CRISPool highlighted. In all a total of 30 CERIF-XML files were used in the pilot. cfPers_CORE cfPers_Class-LINK cfPers_EAddr-LINK cfPers_OrgUnit-LINK cfPers_PAddr-LINK cfPers_ResPubl-LINK cfPersKeyW-LANG cfPersName-ADD cfPersResInt-LANG cfOrgUnit-CORE cfOrgUnit_EAddr-LINK cfOrgUnit_Class-LINK cfOrgUnit_OrgUnit-LINK cfOrgUnit_PAddr-LINK cfOrgUnit_ResPubl-LINK cfOrgUnitName-LANG cfResPubl-RES cfResPubl_Class-LINK

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 11 of 55

cfResPubl_ResPubl-LINK cfResPublAbstr-LANG cfResPublBiblNote-LANG cfResPublKeyW-LANG cfResPublAbbrev-LANG cfResPublSubtitle-LANG cfResPublTitle-LANG cfEAddr-2cfPAddr-2

ND

cfEAddr_Class-LINK

ND

cfClassTerm-LANG cfClass-CLASS Full details are in Appendix 1: CRISPool Data Dictionary, Appendix 2 : CRISPool Class Scheme Data and Appendix 3 \\\documents.

Sourcing the data For Glasgow this was straightforward once the data requirements had been finalised. The information on persons coming from the Institutional HR database and that for publications from the data set produced for the REF bibliometrics pilot. Glasgow considered using data from their institutional repository but at the time this did not link publications to internal authors via the institutional ID. The data from the HR database was already integrated with the research management system at Glasgow and so there were no problems with reusing this data for CRISPool. For Edinburgh detailed person data was provided from the Institutional HR database. While the data provided was of good quality it is worth noting that it took the team at Edinburgh some time to find the right contact within HR who could authorise use of the data for CRISPool. At Glasgow and St Andrews these links have already been made and so no delay was incurred. The publications data was sourced from the central closed Publications Repository and checked to ensure the bibliographic data could be made publicly available. It had originally been planned to use the public Edinburgh Research Archive, but on investigation this only included two publications that were by academics in the current HR feed, and were journal articles. Edinburgh therefore switched to using data from the closed repository, the Publications Repository, which had many more articles, and was the repository used for the RAE submission. For St Andrews all the required data was sourced directly from the Institution’s PURE4 CRIS, which itself is synchronised daily with data from the Institutional HR database. The CRIS is the golden source of publications data.

Producing the CERIF-XML Following on from the workshops, the University of St Andrews created template files with some sample data for each of the CERIF-XML files to be used in the pilot. We used documentation from the eurocris.org web-site and advice and examples from the R4R mapping documents. These sample files were validated against the CERIF-XML 2008-1.1 schema at http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ Note: we started with CERIF-2008_1.0 version but switched to the later version in order to be able to use IDs of greater length. Version 1.0 handled IDs up to 32 characters long; version 1.1. up to 128 characters. The CERIF-XML sample files were created using the text editor Notepad. The sample CERIF-XML files were distributed to the Universities of Edinburgh and Glasgow via the Basecamp site for them to populate with their own data.

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 12 of 55

The University of Glasgow used MS Excel to create a worksheet for each CERIF-XML file required, populating the worksheets with data on persons from their Human Resources database and publications from the REF Bibliometrics pilot; the latter included links to the HR database via the Institutional staff ID. MS Word Mailmerge was then used to merge the data from each worksheet into the correct CERIF-XML template. The xml header and footer was added to each resulting .doc file and saved as .txt and then renamed as .xml. This process took approximately 1 day to complete. For the University of Edinburgh, the process of generating the CERIF-XML files for persons was a largely manual process. HR provided them with an Excel spreadsheet containing academic names, hesa numbers and job titles. These were then amalgamated with further information manually copied from the school staff webpages. This process took 2.5 days to complete. The publications data was provided as an export from the dSpace Publications Repository but has not yet been converted to CERIF-XML for importing. Related work taking place as part of R4R to create a CERIF plug-in for dSpace is expected to provide this functionality. SUPA were provided with lists of people from all three Institutions in order to match to the existing SUPA ID and SUPA theme/s. This data was provided back to St Andrews in spreadsheet format and CERIF-XML cfPers_OrgUnit-LINK files produced linking each person to the main and additional SUPA themes. We had originally planned to use a classification for the SUPA themes but switched to using organisations very quickly once we realised that this would allow us more flexibility in linking other entities such as persons and publications to themes. In practice SUPA treat the themes as virtual organisations. Both sets of files (from Glasgow and Edinburgh) required tidying up before they validated successfully against the CERIF-XML Schemas. The issues included CERIF mismatched tags and elements in the wrong order. There was also a problem initially with files being saved with LATIN-1 encoding rather than UTF-8. All these issues were solved using the freeware text and Unicode editor PSPad12

• Issue 1: The placing of a person’s contact details as an attribute of the person rather than an attribute of the relationship between the person and organisation (cfPers_EAddr-LINK, cfPers_PAddr-LINK). In the CRISPool model which concentrates entirely on work contact details, rather than personal contact details, it is normal that a person’s contact details will change as they move from job to job.

. On the whole it was a successful and straightforward low-tech process although there were a couple of more time-consuming problems where there were inconsistencies in the IDs across files thus preventing data being linked correctly once imported. Again these could be solved using PSPad, which had good functionality for checking several files side by side. What is evident is that the time taken initially to define requirements and prepare sample files was very important. In this project the resource was very limited and undoubtedly if we had had more resource at the member institutions the errors would have been much fewer. Equally, once a more automated process can be established such errors should be removed completely. For St Andrews data was exported using the export framework in Pure. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product. Suitability of CERIF Most of the data mapped across to the CERIF model easily but there were two areas where the CERIF data model imposed restrictions. Both of these have been raised with the euroCRIS CERIF Task Group and could be worked around for this pilot using CERIF classifications.

12 http://www.pspad.com/en/

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 13 of 55

o For the pilot a workaround was used whereby the classification of the cfPers_EAddr and cfPers_PAddr relations were used to carry data. See Appendix 3 CRISPool CERIF to PURE4 mapping for details.

• Issue 2: A one to one relationship between the publication entity and URI (cfResPubl-

CORE.URI). Thus we were unable to record both a DOI 13and URI to a full-text version in the IR against publications. This issue has been discussed within euroCRIS previously and at length and is a philosophical issue rather than a technical one. The current euroCRIS view is that each Publication object is represented by 1 and only 1 URI; if another URI is needed then that is another Publication object. This debate leads into the definitions of ‘work’ and ‘manifestation’, and so on from the FRBR14

o For the pilot we restricted ourselves to the DOI as there was more data for this than for URIs to full-text in IRs.

model, and is not part of the CRISPool project.

13 http://www.doi.org 14 http://www.loc.gov/cds/FRBR.html

Project Acronym: CRISPool Version: 2.2 Contact: [email protected] Date: 01/12/2010

Page 14 of 55

Importing the CERIF-XML The importing of the CERIF-XML into the CRISPool PURE instance was done by uploading all the XML files into WebDAV15 folders. There were four WebDAV folders, one for each institution involved (St Andrews, Glasgow, Edinburgh and SUPA). However, there were some issues with accessing these folders due to the operating systems used by the CRISPool team. A solution was found in using a free program called NetDrive16

Outputs and Results

to gain access to folders. Once the data XML files were uploaded into the appropriate WebDAV folders then the data could be imported into PURE by selecting to synchronise it. If there were any problems with the CERIF-XML in any of the files being imported into PURE then all details of errors could be accessed after a failed synchronisation, the error list would give file name and line number of each issue so that could be amended. This detailed error logging helped to identify inconsistencies between IDs in the separate files, for instance. The synchronisation jobs could be run repeatedly to update existing data and this functionality was used, for example, when we received additional data on external authors from Glasgow. Unfortunately, due to the number of external authors on these publications, (there were an average of 150 authors per publication) the import process was taking so long that we decided to limit the authors to the first 5 (including at least one Glasgow author). Atira adopted an agile approach to the development of the import functionality working closely with the CRISPool team at St Andrews to test first the organisation and person import and then the publications import. This incremental approach meant we could sort out any issues with the organisations and persons before moving on to the much larger data sets containing publications data. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product.

The CRISPool project has several deliverables, the first being the actual CRISPool portal. http://crispool.atira.dk/portal/ The portal has been designed to look and feel the same as the SUPA website. On the front page of the portal a selection of the most recent publications are displayed. A search bar allows anyone to search through researchers, organisations and publications. There is also a navigation menu on the right side of the portal pages which allow a user to search through the available data alphabetically. The portal also offers an option of ‘statistics’ which allows a user to view charts showing the volume and format of research of the institutions involved from the last 5 years. RSS feeds are also available from the portal. The CRISPool project managed successfully to employ the CERIF data model (2008 version 1.1). This can be demonstrated through the CERIF-XML files that have been created during the project as each file will conform to the data model documentation and the schema which can found on the euroCRIS website. http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ The data model used in CRISPool is described in detail in Appendices 1,2 and 3. A less tangible output is the transferable skills developed by the partner institutions in mapping internal data sources to CERIF-XML. The mapping process led to an improved understanding of how the CERIF data model works in a practical situation. For example, at Glasgow, this knowledge helped inform the JISC-funded Enquire17

15

project looking at Research Council Outcomes and Outputs.

http://www.webdav.org/ 16 http://www.sitepoint.com/blogs/2004/10/03/novell-netdrive-webdav-client-for-windows/ 17 http://researchoutcomes.wordpress.com/

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 15 of 55

Overall timings for sourcing, mapping and producing the initial CERIF XML files, once the sample template files had been created, was between 2 and 5 days. A further period of time was spent by the project team at St Andrews checking and amending the files, which as they had been produced semi-manually and to a tight deadline were prone to miss-typing [or miss-copying/pasting!]. This was a one-off relatively manual process for both Glasgow and Edinburgh and would need further work to develop into a sustainable production of CERIF-XML for use in keeping the portal up to date and removing the errors in the files. For St Andrews the CERIF-XML files were created directly from PURE using the functionality that Atira developed. As described in the Methodology section, a systematic baseline review was not able to be carried out due to resource issues at SUPA. However SUPA provided feedback as follows: For the work done by SUPA to gather data from all 6 Institutions [SUPA has only recently been expanded to 8 Institutions] :

‘I'd split the data gathering into two types: data gathering about people in SUPA, and the publication list (which is informed by the first process, however). For the first type, which is gathering the essential contact data for each member of SUPA: This takes approximately 1 month, which includes one week of solid work plus additional time to follow up with the institutions and verify accuracy. For the publication data exercise: This takes approximately 3 months, which includes a mix of solid work periods and following up with institutions. This process includes the initial meetings covering scope, the request for information, the follow up with individual institutions and the accuracy verification, collation and report publishing. ' For St Andrews, prior to implementation of Pure, a School Administrator took 4 days [spread over 2 weeks] to run Web of Science searches for all academic staff and research fellows. These data were not checked by individuals dues to lack of time. With Pure in place each individual member of staff can maintain an up to date accurate publication list that can then be fed out to CRISPool regularly – not just once a year as now. These data are also reused in other online pages such as School web sites. This not only saves time for School Administrators and individual researchers as the data is collected once but also improves data quality and timeliness with the researcher taking responsibility for their own data. If production of XML data streams can be automated at member institutions and synchronised within the portal then this would cut out the annual process of data collection via emailing spreadsheets back and forth between SUPA and each of the institutions. It would also have the benefit that corrections and additions prompted by the publication of information through the pool portal could be carried out directly in the source institutional information systems, saving staff time making separate updates to the SUPA systems and institutional systems.

Project Acronym: CRISPool Version: 2.2 Contact: [email protected] Date: 01/12/2010

Page 16 of 55

Outcomes The project plan listed a set of evaluation factors and questions to address. These are repeated below with an update following the project’s conclusion. Factor to Evaluate Questions to Address Method(s) Measure of Success Outcome Suitability of CERIF 2008

Does it contain all data elements required? Can it be easily extended if not?

Evaluate against SUPA requirements

All elements exist or can be easily add

All but 2 directly mapped. Issue 1 – contact details against person not person-organisation relation; worked around using classification. Issue 2 – single URI per publication; philosophical point to be discussed with euroCRIS; opted for DOI not IR handle as more data; could have been addressed with classification

Ease of mapping to CERIF-XML

What level of technical expertise required?

Feedback from technical experts who did the mapping

Technical expertise already available at member institutions or easily acquired

Yes - Standard text editor tools used by St Andrews, Glasgow and Edinburgh. Basic relational db understanding necessary and knowledge of staff and publications data held by University

Usefulness of CERIF-CRIS

Is the CERIF-CRIS an improvement on previous solution? If so, in what way? If not, in what way?

Evaluation/Feedback CRISPool solution extended to other member institutions

SUPA – yes dynamic searchable portal much better than fixed pdf publications list. At least one other Pool expressed interest. Member Institutions - In principle – yes but requires up to date central data sources and further work to automate CERIF-XML production. So not a CERIF issue in itself. Main technical issue is fragmentation of CERIF-XML, which means import processes, are resource intensive.

Overall time/cost savings

Across all stakeholders, does use of CERIF-XML increase/decrease resource/cost

Evaluation [before/after] Decreases resource/cost or further investigation needed over longer time period

Further investigation needed. However it is indicative that SUPA were unable to collect data this year using existing method because too resource intensive

Data quality Does using a CERIF-CRIS facilitate improvement in

Evaluation [before/after] Data quality improved or likely to be improved

Neutral – as data quality from partner institutions already good for the limited

Project Acronym: CRISPool Version 2.2 Contact: [email protected] Date: 01/12/2010

Page 17 of 55

data quality? Does using a publicly available portal facilitate improvement in data quality?

subset of data we were working with. Not answered as portal not public yet

Project success and impact

To what extent has the project delivered on objectives and how useful are the projects findings?

End Project Report/Lessons Learned

Funder, partners and stakeholder feedback positive CRISPool solution extended to other member institutions and CERIF entities

Partner and stakeholder feedback positive and keen to move pilot to sustainable system; more pools are interested

Project Acronym: CRISPool Version: 2.1 Contact: [email protected] Date: 16/09/2010

Page 18 of 55

Conclusions On the technical side the project has been straightforward - CERIF is flexible and comprehensive and for the most part does not require additional expertise over and above standard relational database modelling; the exception is the use of Classification Schemes particularly when used with Link entities. Here the project benefited from the model expertise of euroCRIS and the practical experience of Atira. For those who do not have access to this expertise and experience it would be very useful to have more sample CERIF-XML files available at the euroCRIS.org web-site. The project has come across a couple of areas where the CERIF data model has not met our needs – or not immediately - and discussion on these is being taken forward in the CERIF Task Group. In one case a workaround was created relatively simply by extending the use of the CERIF classification concept. In the other case a similar work around could have been employed if we had had time to do so. The discussion with euroCRIS therefore is to do with whether such workarounds are the correct way to extend CERIF or whether the core CERIF data model should be extended. The issue of the resource- intensive nature of processing the CERIF-XML which is due to the fragmentation of CERIF-XML into many separate xml files is something that does need to be addressed whether by improving algorithms to process the data or by adjusting the CERIF model; however it is difficult to see how the latter can be done without losing the flexibility of the model. This proved to be such a problem with importing co-authors on some of the Glasgow papers, where typically 150-200 authors existed on each paper, that we had to limit the data to the 5 named authors (including at least one Glasgow author). Finally it has shown that in order to best take advantage of an initiative such as CRISPool, Institutions need at least publications and staff data joined up. For Glasgow this was straightforward as they were able to provide the publications data set from the REF bibliometrics pilot which was linked to internal authors via the HR staff ID. Going forward they are now linking their full publications data set in their Institutional Repository to internal authors and so will be able to provide a more comprehensive set of publications data in the future. For Edinburgh the publications data repository does hold the staff id for the user which could have been matched with the HR database to allow the publications to be matched easily to persons. St Andrews were able to provide comprehensive data on people and their publications directly from their CRIS.

Implications There are specific implications for CRISPool and more generic implications for adopting CERIF-XML as an exchange format within the UK. CRISPool The project partners are keen to take CRISPool forward from a pilot to a live system. However, first we need to identify a clear achievable objective, such as bringing in people and publications data for all members of SUPA to support decision-making and specific reporting requirements. We then need to develop processes to produce the CERIF-XML data automatically and regularly from the various source databases that exist within member institutions. This is not a small undertaking and requires buy in from the partner institutions to the bigger picture across the UK research domain : that improved information management and operational efficiency can be gained by adopting CERIF-XML as the exchange format. CRISPool has demonstrated that those with an integrated research system or CRIS are already at an advantage here. Finally, but importantly, we need a CERIF-CRIS to bring the data together with functionality to view, search, report, and so on. Atira have worked as project partners on the pilot but there is no agreement to continue beyond the end of the pilot and any commercial solution would necessarily need to follow the normal procurement route.

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 19 of 55

It is also worth noting that at least one other Research pool, SICSA18

On the technical side and of relevance to others working with CERIF-XML the resource-intensive nature of processing CERIF-XML needs to be addressed. This could be via reviewing the CERIF-XML model itself (which runs the risk of reducing CERIF’s ability to model the research information domain accurately) or improving the technology that processes large and or/fragmented XML files. It should be noted that in other applications – especially relating to research information – XML has proved to be an inefficient exchange format. EXEM has been developed to (partially) overcome this

has already expressed interest in the idea so a project that was able to provide data for both pools from the common member institutions could be another option; it’s aim would be to demonstrate the scaleability and transferability of using CERIF-XML for this purpose. CERIF-XML in general

http://portal.acm.org/citation.cfm?id=1285888 . However the article at http://www.criticism.com/dita/dss.html suggests that using XML provides gains over legacy data exchange mechanisms. Considering the spreadsheet exchange method of SUPA hitherto, this appears to be borne out by CRISPool despite the apparent inefficiency of CERIF-XML. Brigitte Joerg, CERIF Task Group leader comments ; ‘I understand the fragmentation is seen as a problem. But, from my whole experience with ontologies, with respect to interchange is still the most appropriate format - and makes it very flexible to map to - from legacy systems. Especially due to the fact of fragmentation, you can exchange just the data that you need. Imagine an interrelated or networked ontological graph (which can be based on XML too). Here it becomes a problem of where to cut of - and where to locate the related data. I think - the only way to improve fragmentation in CERIF-XML, would be, to define mini-CERIF-Subontologies - like for person, including all the related entities and their basic attributes and also all the relationships for a particular context. That would mean, your CERIF Person Ontology would integrate the related entities - and you could consider such a Person Ontology as your "integration" manager for person records, because it tells you about all the entities you want to involve, and about all the attributes and relationships that come with them - according to your specification. Ontologies try to integrate information based on a real world view - they use URIs for interconnection - but finally they are also XML-based. They do the opposite of fragmentation - here you have to deal with the complexity - but down to the physical level - you still deal with XML.’ For both CRISPool and CERIF-XML in general further JISC support is recommended whether by the funding of follow-on project/s or by extending the scope of an existing project such as ERIS (for CRISPool) or R4R (for CERIF-XML in general)

References Rogers, N and Ferguson, N (2009), Exchanging Research Information in the UK. EXRI‐UK: A study funded by JISC. http://ie-repository.jisc.ac.uk/448/1/exri_final_v2.pdf Price,D and Caddick, S (2010), How to stay on top, Times Higher, 5th Aug 2010 http://www.timeshighereducation.co.uk/story.asp?storycode=412909 Joerg, B, van Grootel, G and Jeffery, K [Eds], CERIF 2008 1-1 XML Data Exchange Format Specification http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_XML.pdf Joerg, B et al, CERIF 2008 1-1 Semantics http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf 18 www.sicsa.ac.uk

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 20 of 55

Natchetoi, Y, Wu,H, Babin, G and Dagtas, S (2007) EXEM: Efficient XML data exchange management for mobile applications, Information Systems Frontiers , 9 439-448 http://portal.acm.org/citation.cfm?id=1285888 Hoenisch,S (2005) Using Data Structure Standards to Foster Efficiency and Opportunity http://www.criticism.com/dita/dss.html

Project Acronym: CRISPool Version: 2.1 Contact: [email protected] Date: 16/09/2010

Page 21 of 55

Appendix 1

CRISPool Data Dictionary Niall Lockhart, Anna Clements

Version 1 30/04/10

Nal, akc Version 2 11/05/10: a few classschemeIDs and classIDs revised for consistency and to match CRISPool Class Scheme Data.doc

Also blanket changed cfPublicationId to cfResPublId

Some minor corrections to xml files i.e. missing ‘<’ s

To find changes look for 11/05/10

Akc Version 2.1

19/05/10 Add info and examples for external people – in cfPers_CORE, cfPersName-ADD and cfPers_ResPubl-LINK

Nal Version 2.2 24/08/10

Updated all tables to reflect use of CERIF 2008 V1.1

Akc, Nal Final Version 2.3 30/08/10

Update at end of project

Note on IDs CRISPool is bringing together data from several UK Institutions and will use a combination of UK Learner Provider number plus Institutional internal ID to ensure uniqueness of IDs within CRISPool for Person and Publication records. UKPRNs can be found at http://www.ukrlp.co.uk. For CRISPool we need University of Edinburgh 10007790 Glasgow University 10007794

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 22 of 55

University of St Andrews 10007803 Akc 19/05/10 For external authors, we have to just assume each one is a separate entity unless the Institution has some kind of db of external authors. Have added examples in the relevant tables below. No need to link such persons to an organisation; but do need to link to publications. Tables affected: cfPers-CORE, cfPersName-ADD, cfPers_ResPubl-LINK NAL 24/08/10 It is important to note that all CERIF data contained within the document relates to CERIF 2008 version 1.1 and is correct at time of writing. NAL 31/08/10 Where elements cfFraction, cfStartDate and cfEndDate are not supplied then default values shall be used. These will be “1”(cfFraction), “1900-01-01T00:00:00.000+01:00” (cfStartDate) and “2099-12-31T00:00:00.000+01:00” (cfEndDate). Also, I have identified 3 tables with a “*” to show that they have not been implemented in this version of CRISPool.

PERSON

cfPers-CORE Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id INTERNAL; person-[UKPRN]-[Person

InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]-ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID]

cfSex String : 1 char y Examples: “m” “f”

cfURI String : 128 chars Example “http://dept.physics.gla.ac.uk/staff/default.asp?record=672”

cfPers_Class-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 23 of 55

“person-10007803-akc” cfClassId String : 128 chars y Examples :

“internal-person, external-person” “1626”

cfClassSchemeId String : 128 chars y Schemes : “class-scheme-person-types” “class-scheme-hesa-identifiers” “class-scheme-wos-identifiers” “class-scheme-supa-identifiers”

cfFraction Float y Examples: “1.0”, “0.5”

cfStartDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfEndDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfPers_EAddr-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfEAddrId String : 128 chars y Unique email address id: email-[UKPRN]-[Person InstID] e.g.

“email-10007803-akc” cfClassId String : 128 chars y Examples :

“email” skype

cfClassSchemeId String : 128 chars y Scheme: “class-scheme-eaddress-types”

cfFraction Float y Examples: “1.0”, “0.5”

cfStartDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfEndDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfPers_OrgUnit-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 24 of 55

Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfOrgUnitId String : 128 chars y Unique organisation unit address id: organisation-[UKPRN]-

[Organisation-InstID] e.g. “organisation-10007803-80UNIV” “organisation-supa-condensed-matter-material-physics” “organisation-supa-nuclear-plasma-physics”

cfClassId String : 128 chars y Examples : “academic”

cfClassSchemeId String :128 chars y Scheme: “class-scheme-job-families”

cfFraction Float y Examples: “1.0”, “0.5”

cfStartDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfEndDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfPers_PAddr-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfPAddrId String : 128 chars y Unique postal address id: paddress-[UKPRN]-[Organisation-

InstID] e.g. “paddress-10007803-40SCPHAS” cfClassId String : 128 chars y Examples:

“work” cfClassSchemeId String :128 chars y Scheme:

“class-scheme-paddress-types” cfFraction Float y Examples:

“1”, “0.5” cfStartDate Date y Examples:

“2001-01-01T00:00:00”, “1999-12-31T00:00:00” cfEndDate Date y Examples:

“2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 25 of 55

cfPers_ResPubl-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id INTERNAL; person-[UKPRN]-[Person

InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]-ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID]

cfResPublId String : 128 chars y Unique publication id; publication-[UKPRN]-[PublicationID] e.g. “publication-10007794-801001”

cfClassId String : 128 chars y Examples : “is-editor-of” “is-author-of”

cfClassSchemeId String : 128 chars y Schemes : “class-scheme-cerif-person-publication-roles”

cfFraction Float y Examples: “1”, “0.5”

cfStartDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfPersKeyW-LANG Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 32 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfLangCode String: 5 chars y Examples:

“en-GB” “DE”

cfTrans String :1 chars y Examples : “o”

cfKeyW String : 255 chars Examples: “Artificial Intelligence, AI, Human Computer Interfaces” “Physics, Space, Satellite”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 26 of 55

*cfPersName_Pers-LINK 31/08/10 – Due to uncertainty of required data this table has not been used in CRISPool Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId1 String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfPersId2 String : 128 chars y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfClassId String : 128 chars y Example:

“spelling-variant” cfClassSchemeId String : 128 chars y Examples:

“class-scheme-person-name-variants” cfFraction Float y Examples:

“1”, “0.5” cfStartDate Date y Examples:

“2001-01-0101T00:00:00.000+01:00”, “1999-12-3101T00:00:00.000+01:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfPersNameVar String : 128 chars Unknown Data cfPersName-ADD Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]-ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID]

cfFamilyNames String : 64 chars y “Clements” cfOtherNames String : 64 chars cfFirstNames String : 64 chars y “Anna Katharine” cfPersResInt-LANG

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 27 of 55

Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g.

“person-10007803-akc” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples : “o”

cfResInt NClob Examples: “John Smith's current research subject areas are Artificial Intelligence and Human Computer Interfaces.”

OrganisationUnit cfOrgUnit-CORE Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-

[Organisation InstID] e.g. “organisation-10007803-40SCPHAS”

cfAccro String : 16 chars Example: “Physics”

cfURI String : 128 chars Example: “http://www.gla.ac.uk/departments/physics/”

cfOrgUnit_Class-LINK 30/08/10 Added Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-

[Organisation InstID] e.g. “organisation-10007803-40SCPHAS”

cfClassId String : 128 chars y Examples : “university”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 28 of 55

“school” “research-pool” “research-theme”

cfClassSchemeId String : 128 chars y Schemes : “class-scheme-organisation-types”

cfFraction Float y Examples: “1.0”, “0.5”

cfStartDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfOrgUnit_EAddr-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation -[UKPRN]-

[Organisation-InstID] e.g. “organisation-10007803-40SCPHAS”

cfEAddrId String : 128 chars y Unique email address id: email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803-40SCPHAS”

cfClassId String : 128 chars y Examples : “email” “skype”

cfClassSchemeId String : 128 chars y Scheme: “class-scheme-eaddress-types”

cfFraction Float y Examples: “1”, “0.5”

cfStartDate Date y y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfEndDate Date y Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”

cfOrgUnit_OrgUnit-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId1 String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 29 of 55

[Organisation-InstID] e.g. “organisation-10007803-40SCPHAS”

cfOrgUnitId2 String : 128 chars y y Unique organisation unit id; organisation -[UKPRN]-[Organisation-InstID] e.g. “organisation-10007803-40SCPHAS”

cfClassId String : 128 chars y Examples: “is-parent-of”

cfClassSchemeId String : 128 chars y Scheme: “class-scheme-organisation-relationship-types”

cfFraction Float y Examples: “1”, “0.5”

cfStartDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfOrgUnit_PAddr-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-

[Organisation-InstID] e.g. “organisation-10007803-40SCPHAS”

cfPAddrId String : 128 chars y Unique postal address id: paddress-[UKPRN]-[Organisation-InstID] e.g. “paddress-10007803-40SCPHAS”

cfClassId String : 128 chars y Examples: “work”

cfClassSchemeId String :128 chars y Scheme: “class-scheme-paddress-types”

cfFraction Float y Example “1”, “0.5”

cfStartDate Date y y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfOrgUnit_ResPubl-LINK 24/08/10 cfOrgUnitId actually declared as 32 chars on euroCRIS website but this is a mistake, RA from Atira has alerted euroCRIS to this error.

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 30 of 55

Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-[InstID]

e.g. “organisation-10007803-80UNIV” cfResPublId String : 128 chars y Unique publication id; publication-[UKPRN]-[PublicationID]

e.g. “publication-10007794-801001” cfClassId String : 128 chars y Examples:

“is-publisher-of” “is-author-institution-of” “claims-ipr”

cfClassSchemeId String : 128 chars y Schemes: “class-scheme-cerif-orgunit-publication-roles”

cfFraction Float y Examples: “1”, “0.5”

cfStartDate Date y y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfOrgUnitName-LANG Element Type Mandatory Content CERIF Pure CERIF Pure cfOrgUnitId String : 128 chars y y Unique organisation unit id; organisation-[UKPRN]-

[Organisation-InstID] e.g. “organisation-10007803-80UNIV” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfName String : 255 chars y Examples: “The University of St Andrews”

ResultPublication cfResPubl-RES

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 31 of 55

Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. publication-10007803-010101 cfResPublDate Date Examples:

“2001-01-01T00:00:00”, “1999-12-31T00:00:00” cfNum String: 30 chars cfVol String: 30 chars cfEdition String: 30 chars cfSeries String: 30 chars cfIssue String: 30 chars cfStartPage String: 30 chars cfEndPage String: 30 chars cfTotalPages String: 30 chars cfISBN String: 30 chars cfISSN String: 30 chars cfURI String: 128 chars Example:

“http://www.st-andrews.ac.uk/departments/physics/book” cfResPubl_Class-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfClassId String : 128 chars y Examples:

“textbook” “journal-article”

cfClassSchemeId String : 128 chars y Schemes: “class-scheme-cerif-publication-types”

cfFraction Float y Examples: “1”, “0.5”

cfStartDate Date y Examples: “2001-01-0101T00:00:00.000+01:00”, “1999-12-3101T00:00:00.000+01:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 32 of 55

cfResPubl_ResPubl-LINK Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId1 String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfResPublId2 String : 128 chars y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010102” cfClassId String : 128 chars y Examples:

“is-part-of” cfClassSchemeId String : 128 chars y Schemes:

“class-scheme-cerif-publication-publication-roles” cfFraction Float y Examples:

“1”, “0.5” cfStartDate Date y Examples:

“2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” cfEndDate Date y Examples:

“2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” cfResPublAbstr-LANG Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples : “o”

cfAbstr NClob Examples: “An abstract of a publication would be written here.”

cfResPublBiblNote-LANG Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 33 of 55

cfLangCode String: 5 chars Examples: “en_GB” “DE”

cfTrans String :1 chars Examples : “o”

cfBiblNote String : 255 chars Examples: “Additional information on publication up to 255 characters.”

cfResPublKeyW-LANG Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfKeyW String : 255 chars Examples: “Physics, Space, Light, Gravity.”

*cfResPublAbbrev-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfAbbrev String : 255 chars Examples: “Abbreviated title of an article.”

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 34 of 55

*cfResPublSubtitle-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfSubtitle String : 255 chars Examples: “Bloggs blogs about blogs”

cfResPublTitle-LANG 11/05/10 correction to xml tags to make valid Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication

InstID] e.g. “publication-10007803-010101” cfLangCode String: 5 chars y Examples:

“en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfTitle String : 255 chars y Examples: “An Example of a Textbook”

Other cfClassTerm-LANG 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 35 of 55

cfClassId String : 128 chars y Examples : “academic-teaching” “academic-research”

cfClassSchemeId String : 128 chars y Schemes : “class-scheme-10007803-job-families” “class-scheme-10007994-job-families”

cfLangCode String: 5 chars y Examples: “en_GB” “DE”

cfTrans String :1 chars y Examples: “o”

cfTerm String : 64 chars Examples: “Academic Teaching” “Academic Research”

cfEAddr-2ND Element Type Mandatory Content CERIF Pure CERIF Pure cfEAddrId String : 128 chars y Unique email address id: email_UKPRN_[Person InstID] {OR}

email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803-et37”, “email-10007803-40SCPHAS”

cfPAddrId String : 128 chars y Unique postal address id: paddress-[UKPRN]-[Organisation-InstID] e.g. “paddress-10007803-80UNIV”

cfURI String : 128 chars Examples: “[email protected]

cfPAddr-2ND Element Type Mandatory Content CERIF Pure CERIF Pure cfPAddrId String : 128 chars y Unique postal address id: paddress-[UKPRN]-[Organisation-

InstID] e.g. “paddress-10007803-80UNIV” cfCountryCode String : 2 chars y Examples:

“UK” “DE”

cfAddrline1 String : 80 chars cfAddrline2 String : 80 chars cfAddrline3 String : 80 chars

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 36 of 55

cfAddrline4 String : 80 chars cfAddrline5 String : 80 chars cfPostCode String : 16 chars cfCity/Town String : 64 chars cfStateOfCountry String : 64 chars cfURI String : 128 chars cfClassScheme-CLASS Element Type Mandatory Content CERIF Pure CERIF Pure cfClassSchemeId String : 128 chars y Schemes :

“class-scheme-organisation-types” “class-scheme-cerif-publication-publication-roles”

cfURI String : 128 chars Examples: “/uk/crispool/organisation/types” “/org/eurocris/cerif/publication/publication/roles”

cfClass-CLASS Element Type Mandatory Content CERIF Pure CERIF Pure cfClassId String : 128 chars y Examples :

“supa-physics-and-life-sciences” “in-book”

cfClassSchemeId String : 128 chars y Schemes : “class-scheme-supa-themes” “class-scheme-cerif-publication-publication-roles”

cfStartDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”

cfURI String : 128 chars Examples: “/uk/crispool/supa/themes/physics-and-life-sciences” “/org/eurocris/cerif/publication/types/in-book”

Project Acronym: CRISPool Version: 2.1 Contact: [email protected] Date: 16/09/2010

Page 37 of 55

Appendix 2

Class Scheme Data Niall Lockhart, Anna Clements

Version 1 05/04/10

Nal Version 2.0 02/06/10 Added personal job titles for Glasgow

Akc Version 2.1 30/05/10 Remove cfTerm for wos and hesa

Nal Version 2.2 24/08/10 Updated supa identifiers and schema. Job families also modified. Added paddress types, person types and organisation types.

This documents lists the values for each of the class schemes to be used in CRISPool. Those with ‘cerif’ in the title are taken from the documentation on the eurocris website See http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf class-scheme-eaddress-types cfClassId cfTerm Link Entity email Email Address cfOrgUnit_EAddr

cfPers_EAddr skype Skype Address cfOrgUnit_EAddr

cfPers_EAddr class-scheme-paddress-types cfClassId cfTerm Link Entity work Work Address cfOrgUnit_PAddr

cfPers_PAddr home Home Address cfOrgUnit_PAddr

cfPers_PAddr class-scheme-person-types cfClassId cfTerm Link Entity external-person

Internal Person cfPers_Class

internal-person

External Person cfPers_Class

class-scheme-organisation-relationship-types cfClassId cfTerm Link Entity is-parent-of Is Parent Of cfOrgUnit_OrgUnit

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 38 of 55

class-scheme-cerif-orgunit-publication-roles cfClassId cfTerm Link Entity is-publisher-of Is Publisher Of cfOrgUnit_ResPubl claims-ipr Claims IPR Of cfOrgUnit_ResPubl curator Is Curator Of cfOrgUnit_ResPubl reviewer Provides Reviewer For cfOrgUnit_ResPubl is-author-of Is Author Of cfOrgUnit_ResPubl commissioned Has Commissioned cfOrgUnit_ResPubl funded Is Funded By cfOrgUnit_ResPubl author-institution Is Author Institution Of cfOrgUnit_ResPubl publishing-inst Is Publishing Institution Of cfOrgUnit_ResPubl external-org Is External Institution Of cfOrgUnit_ResPubl class-scheme-personal-titles cfClassId cfTerm Link Entity mr Mr cfPers_Class mrs Mrs cfPers_Class miss Miss cfPers_Class ms Ms cfPers_Class dr Dr cfPers_Class prof Professor cfPers_Class class-scheme-academic-titles cfClassId cfTerm Link Entity mlitt MLitt cfPers_Class msc MSc cfPers_Class bsc BSc cfPers_Class ma MA cfPers_Class mphil MPhil cfPers_Class mres MRes cfPers_Class phd PhD cfPers_Class meng MEng cfPers_Class mphys MPhys cfPers_Class mmath MMath cfPers_Class beng BEng cfPers_Class ba BA cfPers_Class pgdip PGDip cfPers_Class Akc 30/06/10 – remove cfTerm and use cfClassID as the actual data value as here are using a cerif classification scheme purely as a way of augmenting base data for a person ie not as a true classification schema class-scheme-hesa-identifiers

cfClassID cfClassId Link Entity 1234567890123 hesa-1234567890123 cfPers_Class 3210987654321 hesa-3210987654321 cfPers_Class

class-scheme-wos-identifiers

cfClassID cfClassId Link Entity 1234-2009 web-of-science- 1234-2009 cfPers_Class 9876-2010 web-of-science- 9876-2010 cfPers_Class

class-scheme-supa-themes cfClassId cfTerm Link Entity main-theme Main Theme cfPers_OrgUnit

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 39 of 55

additional-theme Additional Theme cfPers_OrgUnit class-scheme-supa-indentifiers 24/08/10 Nal no need for supaid as already identified as supa through scheme id

cfClassId cfClassId Link Entity 1620 supaid-1620 cfPers_OrgUnit 349 supaid-349 cfPers_ OrgUnit

class-scheme-cerif-person-publication-roles cfClassId cfTerm Link Entity author Is Author Of cfPers_ResPubl editor Is Editor Of cfPers_ResPubl author-numbered Is Author (Numbered) Of cfPers_ResPubl author-percentage Is Author (Percentage) Of cfPers_ResPubl subject Is Subject Of cfPers_ResPubl commissioned Has Commissioned cfPers_ResPubl reviewer Is Reviewer Of cfPers_ResPubl translator Is Translator Of cfPers_ResPubl publisher Is Publisher Of cfPers_ResPubl commissioned Has Commissioned cfPers_ResPubl Akc 11/05/10 IGNORE class-scheme-person-name-variants cfClassId cfTerm Link Entity spelling-variant Spelling Variant of Person’s Name cfPersName_Pers class-scheme-cerif-publication-types cfClassId cfTerm Link Entity book Book cfResPubl_Class book-review Book Review cfResPubl_Class book-chapter-abstract Book Chapter Abstract cfResPubl_Class book-chapter-review Book Chapter Review cfResPubl_Class in-book In Book cfResPubl_Class anthology Anthology cfResPubl_Class monograph Monograph cfResPubl_Class reference-book Reference book cfResPubl_Class textbook Textbook cfResPubl_Class encyclopaedia Encyclopaedia cfResPubl_Class manual Manual cfResPubl_Class other-book Other Book cfResPubl_Class journal Journal cfResPubl_Class journal-article Journal Article cfResPubl_Class journal-article-abstract Journal Article Abstract cfResPubl_Class journal-article-review Journal Article Review cfResPubl_Class conference-proceedings Conference Proceedings cfResPubl_Class conference-proceedings-article Conference Proceedings

Article cfResPubl_Class

letter Letter cfResPubl_Class letter-to-editor Letter To Editor cfResPubl_Class phd-thesis PhD Thesis cfResPubl_Class doctoral-thesis Doctoral Thesis cfResPubl_Class report Report cfResPubl_Class short-communication Short Communication cfResPubl_Class poster Poster cfResPubl_Class presentation Presentation cfResPubl_Class news-clipping News Clipping cfResPubl_Class commentary Commentary cfResPubl_Class

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 40 of 55

annotation Annotation cfResPubl_Class class-scheme-cerif-publication-publication-roles cfClassId cfTerm Link Entity is-part-of Is Part Of cfResPubl_ResPubl Job Titles etc This area needs to be flexible to cope with the different ways different Institutions categorise their staff. Suggest up to three levels as follows : 1. Personal Job Title : what I want to be known as and what should be shown on portal e.g. ‘Professor of Photonics’ 2. Generic Job Title : for filtering and grouping by SUPA e.g. ‘Professor’ 3. Job Family : for filtering and grouping by SUPA e.g. Academic class-scheme-job-families 24/08/10 Nal Currently these are the only available jobs and schema within CRISPool cfClassId cfTerm Link Entity academic Academic cfPers_OrgUnit academic-research Academic Research

cfPers_OrgUnit academic-teaching Academic Teaching cfPers_OrgUnit honorary Honorary cfPers_OrgUnit emeritus Emeritus cfPers_OrgUnit research-support Research Support cfPers_OrgUnit At St Andrews we can only supply 1 and 3 at moment. St Andrews class-scheme-10007803-personal-job-titles : EXAMPLES as one created per link cfClassId cfTerm Link Entity professor-photonics Professor of Photonics cfPers_OrgUnit honorary-professor Honorary Professor cfPers_OrgUnit supa-advanced-fellow

SUPA Advanced Fellow cfPers_OrgUnit

research-assistant Research Assistant cfPers_OrgUnit pic-technical-manager

PIC Technical Manager cfPers_OrgUnit

class-scheme-10007803-job-families cfClassId cfTerm Link Entity academic Academic cfPers_OrgUnit academic-research Academic Research

cfPers_OrgUnit academic-teaching Academic Teaching cfPers_OrgUnit honorary Honorary cfPers_OrgUnit emeritus Emeritus cfPers_OrgUnit research-support Research Support cfPers_OrgUnit Glasgow class-scheme-10007794-personal-job-titles – added 02/06/10 NL cfClassId cfTerm Link Entity

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 41 of 55

senior-research-fellow

Senior Research Fellow cfPers_OrgUnit

research-fellow-knc-manager

Research Fellow/KNC Manager

cfPers_OrgUnit

professor Professor cfPers_OrgUnit rcuk-research-fellow

RCUK Research Fellow cfPers_OrgUnit

professor-of-physics

Professor of Physics cfPers_OrgUnit

research-fellow Research Fellow cfPers_OrgUnit regius-professor-of-astronomy-astronomer-royal-for-scotland

Regius Professor of Astronomy (Astronomer Royal for Scotland)

cfPers_OrgUnit

reader Reader cfPers_OrgUnit kelvin-chair-of-natural-philosophy

Kelvin Chair of Natural Philosophy

cfPers_OrgUnit

senior-lecturer Senior Lecturer cfPers_OrgUnit reader-in-astrophysics

Reader in Astrophysics cfPers_OrgUnit

lecturer Lecturer cfPers_OrgUnit egee-scotgrid-technical-coordinator

EGEE/ScotGrid Technical Co-ordinator

cfPers_OrgUnit

professor-cargill-chair-of-natural-philosophy

Professor - Cargill Chair of Natural Philosophy

cfPers_OrgUnit

research-fellow-atlas-neural-net-analysis

Research Fellow ATLAS Neural Net Analysis

cfPers_OrgUnit

Yellow highlights – may not be needed as historical records class-scheme-10007794-generic-job-titles cfClassId cfTerm Link Entity administrative-library-and-computing1

Administrative Library & Computing 1

cfPers_OrgUnit

administrative-library-and-computing2

Administrative Library & Computing 2

cfPers_OrgUnit

advisor-of-studies Advisor of Studies cfPers_OrgUnit atypical-worker Atypical Worker cfPers_OrgUnit atypical-worker-grade5 Atypical Worker Grade 5 cfPers_OrgUnit atypical-worker-minimum-wage Atypical Worker Minimum

Wage cfPers_OrgUnit

head-of-department Head Of Department cfPers_OrgUnit

honorary-staff Honorary Staff cfPers_OrgUnit

mpa-level4 MPA Level 4 cfPers_OrgUnit

mpa-level5 MPA Level 5 cfPers_OrgUnit mpa-level6 MPA Level 6

cfPers_OrgUnit mpa-level7 MPA Level 7

cfPers_OrgUnit mpa-level8 MPA Level 8

cfPers_OrgUnit marie-curie-fellow Marie Curie Fellow

cfPers_OrgUnit

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 42 of 55

operational2 Operational 2 cfPers_OrgUnit

professor Professor cfPers_OrgUnit

reader Reader cfPers_OrgUnit

research1a Research 1A cfPers_OrgUnit

research1b Research 1B cfPers_OrgUnit

research2 Research 2 cfPers_OrgUnit

research-and-teaching6 Research & Teaching 6 cfPers_OrgUnit

research-and-teaching7 Research & Teaching 7 cfPers_OrgUnit

research-and-teaching8 Research & Teaching 8 cfPers_OrgUnit

research-and-teaching9 Research & Teaching 9 cfPers_OrgUnit

scholar Scholar cfPers_OrgUnit

scholarship Scholarship cfPers_OrgUnit

senior-lecturer Senior Lecturer cfPers_OrgUnit

technical2 Technical 2 cfPers_OrgUnit

technical4 Technical 4 cfPers_OrgUnit

technical5 Technical 5 cfPers_OrgUnit

technical6 Technical 6 cfPers_OrgUnit

technical7 Technical 7 cfPers_OrgUnit

technician-a Technician A cfPers_OrgUnit

technician-c Technician C cfPers_OrgUnit

technician-d Technician D cfPers_OrgUnit

technician-e Technician E cfPers_OrgUnit

technician-f Technician F cfPers_OrgUnit

class-scheme-10007794-job-families cfClassId cfTerm Link Entity admin-library-computing

Administrative Library and Computing

cfPers_OrgUnit

academic-related Academic and Related cfPers_OrgUnit

atypical Atypical Workers cfPers_OrgUnit

honorary Honorary University cfPers_OrgUnit

mpa Management Professional and

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 43 of 55

Administrative cfPers_OrgUnit operational Operational

cfPers_OrgUnit academic Academic

cfPers_OrgUnit research Research

cfPers_OrgUnit research-and-teaching

Research and Teaching cfPers_OrgUnit

scholars Scholars cfPers_OrgUnit

technical Technical and Related cfPers_OrgUnit

Edinburgh – tbc – need job-families and personal-job-titles class-scheme-10007790-job-families cfClassId cfTerm Link Entity

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit

cfPers_OrgUnit SUPA-tbc Suggestion here is for SUPA to provide class-scheme-supa-job-families to which member Institutions can map their own job families. class-scheme-organisation-types cfClassId cfTerm Link Entity department Department cfOrgUnit_Class university University cfOrgUnit_Class school School cfOrgUnit_Class college College cfOrgUnit_Class research-pool Research Pool cfOrgUnit_Class research-theme Research Theme cfOrgUnit_Class

Project Acronym: CRISPool Version: 2.1 Contact: [email protected] Date: 16/09/2010

Page 44 of 55

Appendix 3

CRISPool CERIF to PURE4 mapping • 1 Important notes o 1.1 CERIF imposes constraints on data o 1.2 Persons/Authors o 1.3 Fragmentation o 1.4 Translations • 2 Classification mappings o 2.1 Organisations o 2.2 SUPA Themes o 2.3 Persons 2.3.1 Employment types o 2.4 Publications 2.4.1 Publication Peer Review 2.4.2 Organisation to publication relations o 2.5 Electronic addresses • 3 Entity Mappings o 3.1 Organisation o 3.2 Person 3.2.1 Person-organisation relationships o 3.3 Address (UK) o 3.4 Email, Skype, etc. o 3.5 Publications 3.5.1 General fields 3.5.2 Contribution to Journal 3.5.3 Book Anthology 3.5.4 Conference Contribution 3.5.5 Contribution to Book Anthology 3.5.6 Other Contribution 3.5.7 Working Paper

Important notes

CERIF imposes constraints on data

The CERIF XML format imposes many constraints on the data it holds. Many text strings in the format is limited by a max length constraint, and often this constraint is too small. Imposing such restrictions on data is not suitable for an exchange format as CERIF actually is. If a receiving CRIS system has length constraints on text strings, the problem should be dealt with internally. This has been reported this to euroCRIS for their information.

Persons/Authors

Only "real" persons are exported/imported, meaning that when a person is connected to a publication the alias author name, which can be different from the persons's actual name, is discarded.

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 45 of 55

Fragmentation

Generally the Cerif XML data model is vastly fragmented. Data regarding an entity is scattered into several XML files and namespaces, which means that the referential integrity is lost. Thus it is up to the data provider to ensure that references are correct.

Translations

Cerif uses language codes for specifying languages and translations, but the only specification available is that language codes are 5 characters long. In this project we use the well-known standard <language code>_<country code>, where

• language code is the two letter ISO 639-2 standard (see http://www.loc.gov/standards/iso639-2/englangn.html)

• country code is the two letter ISO 3166 standard (see http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html)

Examples are: en_GB (british english), en_US (american english), da_DK (danish), fr_FR (french from France).

Classification mappings Classifications are mapped from Cerif classification id and scheme id to either a PURE4 classification URI or a contextual meaning.

Organisations

An organisation is classified by an organisation type. In CERIF organisations are classified via the cfOrgUnit_Class classification element. CERIF Scheme id: class-scheme-organisation-types

Cerif cfClassId PURE Classification URI

university /dk/atira/pure/organisation/organisationtypes/organisation/university college /dk/atira/pure/organisation/organisationtypes/organisation/college faculty /dk/atira/pure/organisation/organisationtypes/organisation/faculty school /dk/atira/pure/organisation/organisationtypes/organisation/school department /dk/atira/pure/organisation/organisationtypes/organisation/department institute /dk/atira/pure/organisation/organisationtypes/organisation/institue research-pool /dk/atira/pure/organisation/organisationtypes/organisation/research research-theme /dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

publisher /dk/atira/pure/publisher/publishertypes/publisher/publisher

The organisation relationship scheme classifies an organisation to organisation relation and is specified in the CERIF cfOrgUnit_OrgUnit link element. Scheme id: class-scheme-organisation-relationship-types

Cerif cfClassId Contextual meaning

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 46 of 55

is-parent-of cfOrgUnitId1 is parent of

SUPA Themes

cfOrgUnitId2 (cfOrgUnit_OrgUnit)

SUPA Themes are mapped to organisations in the Research Theme classification. CERIF Scheme id: class-scheme-supa-themes

Cerif cfClassId PURE Classification URI

supa-particle-physics /dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-astronomy-space-physics

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-condensed-matter-material-physics

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-physics-life-sciences /dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-energy /dk/atira/pure/organisation/organisationtypes/organisation/researchtheme supa-nuclear-plasma-physics

/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

supa-photonics /dk/atira/pure/organisation/organisationtypes/organisation/researchtheme

Persons

Both internal and external authors are mapped to the CERIF cfPers type and distinguished from each other by classification via the cfPers_Class element. Scheme id: class-scheme-person-types (cfPers_Class)

Cerif cfClassId Contextual meaning

internal-person the person is mapped to a PURE Person (and PURE authors)

external-person

the person is mapped to a PURE External Person Author (only present on publications etc.)

Employment types

A person's relation to an organisation is classified by an employment type. This is expressed in CERIF via the classification present in the cfPers_OrgUnit link element. SchemeId: class-scheme-job-families

Cerif cfClassId PURE Classification URI academic /dk/atira/pure/person/employmenttypes/academic academic-research /dk/atira/pure/person/employmenttypes/academicresearch academic-teaching /dk/atira/pure/person/employmenttypes/academicteaching

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 47 of 55

honorary /dk/atira/pure/person/employmenttypes/honorary emeritus /dk/atira/pure/person/employmenttypes/emeritus research-support /dk/atira/pure/person/employmenttypes/research-support

Publications

Scheme id: class-scheme-cerif-publication-types

Cerif cfClassId PURE Classification URI

book /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book book-review /dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other book-chapter-abstract /dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/foreword

book-chapter-review /dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other

in-book /dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry anthology /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology monograph /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/special reference-book /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book textbook /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book encyclopaedia /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology manual /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other other-book /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other journal /dk/atira/pure/journal/journaltypes/journal/journal journal-article /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/article journal-article-abstract /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter

journal-article-review /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/scientific

conference-proceedings /dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other

conference-proceedings-article

/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/paper

letter /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter letter-to-editor /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter phd-thesis /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly doctoral-thesis /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly report /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/commissioned short-communication /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other

poster /dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/poster presentation /dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other news-clipping /dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry commentary /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment annotation /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 48 of 55

Publication Peer Review

To signal whether the publication has been peer reviewed or not. This is done using the cfResPubl_Class element. Scheme id: class-scheme-publication-peer-review

Cerif cfClassId Contextual meaning is-reviewed The publication has been reviewed by a peer is-not-reviewed The publication has not been reviewed by a peer

Organisation to publication relations

Scheme id: class-scheme-cerif-orgunit-publication-roles

Cerif cfClassId Contextual meaning claims-ipr the organisation is considered the owner of the publication author-institution the organisation has an author on the publication is-author-of the organisation has an author on the publication is-publisher-of for future use publishing-inst for future use curator for future use reviewer for future use commissioned for future use funded for future use external-org for future use

Electronic addresses

The electronic address classification is used to identify different types of addresses and is specified in the cfPers_EAddr element. Scheme id: class-scheme-eaddress-types

Cerif cfClassId Context Type email Email address (cfEAddr) skype Skype address (cfEAddr) messenger Instant Messaging web Web site URL phone Phone number mobile Mobile phone number fax Fax number

Entity Mappings

Organisation

PURE field CERIF field PURE Mandatory

CERIF Mandatory

Default value

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 49 of 55

name cfOrgUnitName.cfName Y shortName cfOrgUnit.cfAcro

period.start cfOrgUnit_Class.cfStartDate (the first appearance) Y Y today

peroid.end cfOrgUnit_Class.cfEndtDate (the first appearance) Y

type cfOrgUnit_Class (the first appearance) Y Y visibility NA Y FREE keywords cfOrgUnitKeyw.cfKwyw website cfOrgUnit.cfURI

email cfEAddr (first appearance classified as email)

Person

The UK model has two different person-organisation relation types which are different if the person is staff or student. In this proof of concept project, we assume that only staff are synchronised.

PURE field CERIF field PURE Mandatory

CERIF Mandatory Default value

name.firstname cfPersName-ADD.cfFirstNames (first appearance)

Y

name.lastName cfPersName-ADD.cfLastNames (first appearance)

Y

- cfPersName-ADD.cfOtherNames

nameVariants cfPersName-ADD (2nd to last appearance)

sex cfPers.cfSex Y

Person-organisation relationships

In Cerif a number of postal address, email address, etc. is associated directly with the person. In PURE these relations are gathered as metadata on a person-organisation relation and a person can have one or more such relations. To overcome this obstacle CRISPool CERIF mapping bends the rules by using the classification of the cfPers_EAddr and cfPers_PAddr relations to carry data. Thus the following special classification schemes have been made. Common for these classifications is that the cfClassId contains the cfOrgUnitId of the related organisation.

• cfPers_PAddr o class-scheme-person-organisation-address-postal specifies a person's

postal work address in relation to an organisation • cfPers_EAddr

o class-scheme-person-organisation-address-email specifies a person's email work address in relation to an organisation

o class-scheme-person-organisation-address-web specifies a person's web work address in relation to an organisation

o class-scheme-person-organisation-address-phone specifies a person's work phone in relation to an organisation

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 50 of 55

o class-scheme-person-organisation-address-mphone specifies a person's work mobile phone in relation to an organisation

o class-scheme-person-organisation-address-fax specifies a person's work fax number in relation to an organisation

A person's relation to an organisation is classified by an employment type in the class-scheme-job-families

Address (UK)

scheme. This is expressed in CERIF via the classification present in a cfPers_OrgUnit link element.

PURE field CERIF field PURE Mandatory

CERIF Mandatory Default value

postalCode cfPAddr.cfPostCode country cfPAddr.cfCountryCode Y address1 cfPAddr.cfAddrline1 address2 cfPAddr.cfAddrline2 address3 cfPAddr.cfAddrline3 address4 cfPAddr.cfAddrline4 address5 cfPAddr.cfAddrline5

Email, Skype, etc.

Cerif electronic addresses such as email, skype and messenger is specified via an cfEAddr. The different electronic addresses is distinguished from each other by their classification as specified earlier in the document. The actual electronic address is specified in the cfURI element.

Publications

General fields

PURE field CERIF field PURE Mandatory

CERIF Mandatory

Default value

publishedDate, publicationYear, -Month, -Day

cfResPubl.cfResPublDate y

numberOfPages cfResPubl.cfTotalPages title (localised) cfResPublTitle.cfTitle Y abstract (localised) cfResPublAbstr.cfAbstr

bibliographicalNote cfResPublBiblNote.cfBiblNote keywords cfResPublKeyw.cfKeyw

Contribution to Journal

PURE field CERIF field PURE Mandatory

CERIF Mandatory

Default value

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 51 of 55

pages cfResPubl.cfStartPage, cfResPubl.cfEndPage

journalNumber cfReslPubl.cfNum volume cfResPubl.cfVol

Book Anthology

PURE field CERIF field PURE Mandatory

CERIF Mandatory Default value

printIsbns cfResPubl.cfISBN edition cfResPubl.cfEdition volume cfResPubl.cfVol

Conference Contribution

PURE field CERIF field PURE Mandatory

CERIF Mandatory

Default value

pages cfResPubl.cfStartPage, cfResPubl.cfEndPage

peerReview cfResPubl_Class (peer review classification)

Contribution to Book Anthology

PURE field CERIF field PURE Mandatory

CERIF Mandatory Default value

printIsbns cfResPubl.cfISBN edition cfResPubl.cfEdition hostPublicationTitle cfResPubl.cfSeries

Other Contribution

PURE field CERIF field PURE CERIF Default value

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 52 of 55

Mandatory Mandatory printIsbns cfResPubl.cfISBN

Working Paper

PURE field CERIF field PURE Mandatory

CERIF Mandatory Default value

printIsbns cfResPubl.cfISBN

Project Acronym: CRISPool Version: 2.1 Contact: [email protected] Date: 16/09/2010

Page 53 of 55

Appendix 4

Technical Summary - CRISPool project prototype implementation

Created by: Atira A/S, edited by Thomas Vestdam

Date: 87 February 2011

Version: 1.0

Rev. nr. 19

Technical Summary

Below we have outlined how the CERIF-XML import and export functionality was implemented in Pure for the

CRISPool project.

In general, we have observed a few, but important, problems when using CERIF-XML as an exchange format:

Fragmentation – introduces unnecessary complexity, especially in import algorithms, as input must be scanned several times in order to collect all relevant XML-fragments that make up a single entity (e.g. a person). In addition, the excessive scanning of XML input also causes performance issues. Suggestion: allow certain XML entities/types to include other relevant entities resulting in a single comprehensive document type covering everything related to that type. E.g. the person element could allow inclusion of optional sub-elements such as names, keywords, relations other to other entities, etc. That is, the CERIF-model is kept as it is, but the exchange format becomes more suited for machine processing (as well as improve human readability).

Too many namespaces – parsing and querying (e.g. using XPath) CERIF-XML is very cumbersome as every single element has its own namespace. Suggestion: only have one namespace per CERIF version. This would also allow having only one XML schema defining CERIF-XML.

Constraints – the XML format should not impose too many constraints on data sizes other than IDs. It makes good sense to keep an upper limit to ID lengths, but we suggest that names, titles, abstract, etc. should be unbounded, and leave it up the different CRIS systems to decide, what to do if the incoming data length is greater than the systems internal representation.

Most of the issues seem to stem from the fact that the CERIF-XML format is very close to the relational database

schema defined the CERIF model. This leaves some desired improvements to CERIF-XML as an exchange

format. However, the bottom line is that CERIF (XML) can, as such, be utilised as a flexible exchange format.

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 54 of 55

Outline of the CERIF-XML import functionality Importing CERIF-XML into Pure is done by first loading the supplied CERIF-XML files for a given data-provider

into an XML-database (eXist-db was used, http://exist.sourceforge.net/). Content is either created or updated in

Pure based on the information in the XML-database. If a give piece of content (e.g. person, organisation,

research output) does not already exists in Pure then the content is created, and stored along with the id found in

the input (e.g. cfPersId) as a source id – it is this source id that is used to check for existence in Pure.

When creating or updating content all relevant bits and pieces are loaded from the XML-database – e.g. for a

given person that would be XML-fragments relating to the specific person id such as:

the person element

person name elements

associated keyword elements

person-organisation relation elements

and so on

The relevant XML-fragments are found by performing several XPath queries in the XML-database in order to

provide a “single document” containing all XML-fragments for a given piece of content. The XML-fragments are

then transferred to relevant entities in the Pure model (we use XMLBeans to create binding to Java types,

http://xmlbeans.apache.org/).

The XML-database approach was choose over a handwritten parser for the reason of simplicity, and in order to

be able to have a better basis for handling very large data-sets (if needed the XML-database can be kept in

memory, or be streamed to the file-system depending on the needs).

Outline of the CERIF-XML export functionality Exporting content from Pure is implemented using the export framework in Pure by defining a series of

“converters”. Each converter is responsible for converting a given a Pure model entity (e.g. Person, Organisation,

Research Output, Journal, Patent) to CERIF-XML. The converter for a specific model entity is responsible for

creating all relevant CERIF-XML fragments representing that entity in CERIF-XML. E.g. for a person that would

be fragments such as

the CERIF person element, in a CERIF persons elements XML file

the persons name, in a CERIF person name elements XML file

associated keywords to a person, in a CERIF person keyword elements XML file

relations to organisations, in a CERIF person organisation elements XML file

and so on

Project Acronym: CRISPool Version 2.1 Contact: [email protected] Date: 16/09/2010

Page 55 of 55

When exporting, the relevant data is loaded based on a list of the organisations that data is needed for – each

research output associated with any of the input organisations or their sub-organisations is loaded and converted

one by one. For each research output, associated organisations, authors (persons) and journals are loaded and

converted. In turn, when converting a person, any associated organisations are converted, and when converting

an organisation any associated organisations (e.g. sub-organisations) are converted as well. Due to the recursive

nature of the order in which entities are loaded and converted the exporter keeps track of already converted

entities by keeping a list of their UUIDs. This ensures that entities are only loaded and converted once.

The actual XML files are generated when all relevant entities have been converted by serialising all XML

fragments in a set of CERIF-XML files. In this specific implementation an XML-database was used to temporarily

store the XML-fragments while converting data, and when the conversion was done, the final XML-files where

created by utilizing the serializing capabilities of the XML-database. The CERIF-XML export is provided as a

special web-service that the delivers serialised CERIF-XML files as a bundled zip-file.

The procedure and techniques described above can be applied for any system that aims to export to CERIF-

XML. However, the description of how data is loaded is of cause Pure specific, and just serves as an example.

While writing the exporter a mapping document was created, and mapping decisions were recorded in the

document.