© brigitte jörg june 4th, 2008 in maribor, slovenia project results analyzing european research...

26
Project Results © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Analyzing European Research Competencies in IST Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol (DFKI) (IJS) (DFKI) (IJS)

Upload: cameron-rose

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Analyzing European Research

Competencies in IST

– Results from a European SSA Project –

Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol

(DFKI) (IJS) (DFKI) (IJS)

Page 2: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Project Information

Funding Organization: European Commission Funding Program: Sixth Framework Programme

(FP6: IST (3rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, CCLRC Project Consortium: 15 partners from EU MS, NMS

and ACC

Page 3: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Project Consortium

Deutsches Forschungszentrum für Künstliche Intelligenz, Germany

Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc.,

Hungary Institute of Mathematics and Computer Science, Uni of Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK

(formerly CCLRC, UK)

Page 4: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Technology Partners

DFKI

Co-ordinator

“LT World” PortalInformation Extraction

Semantic Web

DFKI

Co-ordinator

“LT World” PortalInformation Extraction

Semantic Web

Jozef Stefan Institute Technical Co-ordinator

“Project Intelligence”Data Mining

Social Network Analysis

Jozef Stefan Institute Technical Co-ordinator

“Project Intelligence”Data Mining

Social Network Analysis

Ontotext

“KIM Semantic Annotation Platform”

Ontotext

“KIM Semantic Annotation Platform”

euroCRIS

“CERIF” StandardAccess to Data

euroCRIS

“CERIF” StandardAccess to Data

Page 5: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Project Objectives

Set up and populate an information portal on IST research

Provide information about RTD actors and their experience and expertise

Provide innovative and automated services

To promote RTD competencies in specific fields

To support partner search for IST proposals and commercial projects

Page 6: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Presentation Outline

Information Repository

Data Collection

Data Integration / Data Cleaning

Evaluation of Results

Analytic Tools

Overall Conclusion

Page 7: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Repository Features

Information Repository (CERIF 2004) containing Organisation Person Project Publications

Data Collection (CERIF XML) from National CRISs National Collections Web Crawlings Community Support

Data Integration into ONE single dataset to enable analysis at European Level

Data Cleaning with Supervised Machine Learning Methods (Active Learning)

Page 8: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Repository Data Analysis

Duplicate records inherent in single datasets

Even more duplicate records after merging single datasets

Most obvious duplicates for organisations and persons

no significant number of duplicate projects publications have been ignored

Duplicate records are a known problem

Page 9: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Problem: duplicate detection in record set A

Given: a set of records in A Classify: every pair (a,b) A x A

M U (set of true matches) (set of true non matches)

Formal Problem Definition (Winkler 2006)

Page 10: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Heuristic Analysis of Random Samples: National Datasets / Cordis Datasets

most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets

not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined

Decision taken with respect to the IST World scope not touching project records ignore publication records find a solution for person records (IST World

Community) concentrate on cleaning organisation records

IST World Problem Definition

Page 11: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Problems with Organisation Records

Most entries had slightly different names caused by additional special characters or character modifications

Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, )

Mixture of Organisation Names and Department Names Differences in Addresses

Data Cleaning Application

Page 12: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

IST World Dataset Integration

records in A Blocked Records in A x A Blocked Records in A x A with comparison Features

Blocked Records in A x A with comparison Features All records have M (red) and U (green) class labels

1. blocking 2. feature generation

Blocked Records in A x A with comparison Features Some records have M (red) and U (green) class labels

3. active learning

4. model induction

5. model application

Automatic Classification Model

Organisation Names:Fulltext IndexingQuerying

Organisation Names + Location(1) Name/Location Strings (Bag of Words)(2) Word/Character Order (String Kernels)(3) Spelling Errors (Edit Distance Measure)(4) Normalization of (1-3)

Human DecisionM = MatchU = Non-Match- = unknown

Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown

Machine Decision M = Match U = Non-Match

Knowledge about

Records

Page 13: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Active Learning Application

Page 14: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Evalution of Results in CORDIS FP6 dataset

human evaluation of 1000 organisation record pairs 30 M correct; 934 U correct 1 M incorrect; 35 U incorrect 97% precision 46% recall

integration approach worked well can be used for large scale integration tasks

Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall

Page 15: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Analytic Tools

Advanced Tools Collaboration Diagram Competence Diagram

Experimental Tools Collaobration Trends Competence Trends Consortia Prediction Semantic Search

Page 16: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

How to analyze or generate a Diagram

(1) definition of a query in the IST World Portal

(2) get a list of result records matching the query

(3) generate diagrams based on results

Page 17: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Competence Diagram

Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6

Thematic Areas (Blue Clouds):SEMANTICHEALTHLEGALCHANGINGROADMAPSOFTWARE

Projects (Red Dots)Linked with Full Record in Repository

Page 18: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Competence Diagram

Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6

Goals (List of Keywords):DEMENTIAPEOPLEMEDICALSTANDARDS…

Configuration of Result Space:40% of result list30 topics

Page 19: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Competence Diagram

Query: IST SSA projects within FP6Aim: investigate the thematic range of SSA projects in FP6

Goals

Configuration of Result Space:40% of result list30 topics

Themes

Page 20: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Collaboration Diagram

Query: IST SSA projects within FP6Aim: investigate the collaboration of SSA partners in FP6

Number of joint partners

Configuration of Result Space:20% of result list

Project

Page 21: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Evaluation of Analytic Tools

IST World allowed to perform the tasks defined

for more details see the full paper in the Proceedings

All analytics depend on the data behind

The analytic tools are very powerful

Page 22: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Evaluation of Queries

Query execution performed in March 2008 Queried datasets IST World / Cordis

IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html

IST World (crawled

Data from CORDIS)

CORDIS

Investigation Date March 2008 March 2008

Last Updated January 2007 constantly updated

Query 1: Specific Support Action IST FP6 64 208

Query 2: Specific Support Action IST 377 1178

Query 3: Specific Support Action FP6 185 1507

Query 4: Specific Support Action 1554 2012

Query 5: Project Keywords: Specific Support Action Programme: IST, StartDate: After 01/01/2002

200 not checked

Page 23: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Results of Query Evaluation

Discovered inconsistencies with Cordis data:

„FP6“ string: 30 of 80 relevant records missed the string

„SSA“ string: 15 of 208 relevant records missed the string

„Specific Support Action“ string: 15 of 208 relevant records missed the string

Dates (Year of the call): not consistently recorded

Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others

An investigation of the results of the Query 1 in Cordis revealed:80 projects of the result list are missing in IST World

Page 24: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Overall Conclusion

Integration Method: Could be further developed Test data could be used to generate a better

classification model Feature generation could be improved by

using ontological knowledge Transfer learning methods might be helpful

for re-use of the learned model

Evaluation of large Datasets: very difficult needs expert knowledge

Analytic Tools: depend on quality data behind are very powerful for investigation of large datasets

Page 25: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

European Research Dataset (entries)

Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs

Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp

Community: 61 Orgs, 41 Proj, 435 Exp

January 2

008

January 2

008

Page 26: © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project

Project Results

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia

Beyond the Project

IST World is online: http://www.ist-world.org/

Registration is Registration is freefree

Create your

Competence Map / Collaboration Map

Continuation is planned …