lider: fp7 – 610782lider-project.eu/sites/default/files/d4.8-v1.2.pdf · lider: fp7 – 610782...

15
LIDER: FP7 – 610782 Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe Deliverable number D4.8 Deliverable title Fourth Roadmapping Workshop Report Main Authors Felix Sasaki Grant Agreement number 610782 Project ref. no FP7-610782 Project acronym LIDER Project full name Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe Starting date (dur.) 1/11/2013 (24 months) Ending date 31/10/2015 Project website http://www.lider-project.eu/ Coordinator Asunción Gómez-Pérez Address Campus de Montegancedo sn. 28660 Boadilla del Monte, Madrid, Spain Reply to [email protected] Phone +34-91-336-7417 Fax +34-91-3524819

Upload: others

Post on 01-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

LIDER: FP7 – 610782 Linked Data as an enabler of cross-media and multilingual content

analytics for enterprises across Europe Deliverable number D4.8

Deliverable title Fourth Roadmapping Workshop Report

Main Authors Felix Sasaki Grant Agreement number

610782

Project ref. no FP7-610782 Project acronym LIDER Project full name Linked Data as an enabler of cross-media and multilingual

content analytics for enterprises across Europe Starting date (dur.) 1/11/2013 (24 months) Ending date 31/10/2015 Project website http://www.lider-project.eu/ Coordinator Asunción Gómez-Pérez Address Campus de Montegancedo sn. 28660 Boadilla del Monte,

Madrid, Spain Reply to [email protected] Phone +34-91-336-7417 Fax +34-91-3524819

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 2 of 15

Document Identifier D4.8 Class Deliverable LIDER EU-ICT-2013-610782 Version 1.1 Document due date 31 January 2015 Submitted 13 February 2015 Responsible W3C/ERCIM Reply to [email protected] Document status Final Nature O(Other) Dissemination level PU(Public) WP/Task responsible(s) Felix Sasaki, DFKI / W3C Fellow Contributors - Distribution List Consortium Partners Reviewers Reviewed by the project consortium Document Location http://lider-project.eu/?q=doc/deliverables

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 3 of 15

Executive Summary This document, the fourth roadmapping workshop report, summarizes the outcome of three roadmapping activities:

• On 3 December 2014, LIDER held two sessions at the SHARE-PSI 2.0 workshop in Lisbon. These sessions provided valuable feedback from the European public sector information community.

• On 7 January 2015, DFKI held a meeting between researchers from two communities: Linguistic Linked Data (LLD) and Machine Translation (MT). The aim was to discuss research topics of common interest, strategic input to European R&D planning, and potential next steps.

• The LIDER participation in the 16 January 2015 Big Data Networking Day, organised by the European Commission.

This report takes a different approach compared to the previous roadmapping reports. It does not focus on summarizing dedicated LIDER roadmapping workshops, but different types of roadmapping activates. LIDER decided to report on these activities since they give insights in several areas like public sector information and machine translation research, which have not been addressed by previous LIDER activities. All roadmapping activities held by LIDER are listed at

https://www.w3.org/community/ld4lt/wiki/Lider_roadmapping_activities This page will be kept up to date during the duration of the project.

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 4 of 15

Document Information  IST Project Number

FP7-610782 Acronym LIDER

Full Title Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe

Project URL http://www.lider-project.eu/ Document URL http://lider-project.eu/?q=doc/deliverables EU Project Officer Susan Fraser Deliverable Number D4.8 Title Fourth

Roadmapping Workshop Report

Workpackage Number 4 Title Community building and dissemination

Date of Delivery Contractual 31 January

2015 Actual 13 February

2015 Status version 1.2 final n Nature prototype □ report □ dissemination n Dissemination level

public n consortium □

Authors (Partner) Felix Sasaki, DFKI / W3C Fellow  

Responsible Author

Name Felix Sasaki   E-mail [email protected] Partner DFKI / W3C

Fellow  Phone +49-30-23895-1807

Abstract (for dissemination)

This document, the fourth roadmapping workshop report, summarizes the outcome of three roadmapping activities: A LIDER held 3 December 2014 at the SHARE-PSI 2.0 workshop in Lisbon. A meeting organised 7 January 2015 by DFKI between researchers from two communities: Linguistic Linked Data (LLD) and Machine Translation (MT). The LIDER participation in the 16 January 2015 Big Data Networking Day, organised by the European Commission.

Keywords LIDER, roadmapping workshop, report Version Modification(s) Date Author(s)

01 First Draft 04/02/15 Felix Sasaki, DFKI / W3C Fellow

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 5 of 15

02 Review by Bettina Klimek (InfAI) 05/02/15 Felix Sasaki, DFKI / W3C Fellow

03 Review by Asunción Gómez-Pérez (UPM)

08/02/15 Felix Sasaki, DFKI / W3C Fellow

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 6 of 15

Project Consortium Information

Participants Contact Universidad Politécnica de Madrid

 

Asunción Gómez-Pérez Email:  [email protected]    

The Provost, Fellows, Foundation Scholars & The Other Members of Board of The College of the Holy & Undivided Trinity of Queen Elizabeth near Dublin (Trinity College Dublin, Ireland)

 

David Lewis Email: [email protected]  

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI, Germany)

 

Felix Sasaki Email:   [email protected]    

National University of Ireland, Galway (NUI Galway, Ireland)  

Paul Buitelaar Email: [email protected]  

Institut für Angewandte Informatik EV (INFAI, Germany)

 

Sebastian Hellmann Email: [email protected]  

Universität Bielefeld (UNIBI, Germany)

 

Philipp Cimiano Email:   [email protected]    

Universita degli Studi di Roma La Sapienza (UNIVERSITA DEGLI STU, Italy)

Roberto Navigli Email: [email protected]  

GEIE ERCIM (ERCIM, France)

 

Felix Sasaki Email: [email protected]  

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 7 of 15

Table of Contents

1   INTRODUCTION ...................................................................................................................... 8  2   LIDER SESSION AT SHARE-PSI EVENT ............................................................................... 9  

2.1   INTRODUCTION .................................................................................................................... 9  2.2   CONTRIBUTIONS .................................................................................................................. 9  2.3   KEY POINTS OF THE SESSIONS ........................................................................................... 10  

3   LINKED DATA AND MACHINE TRANSLATION .................................................................. 11  3.1   INTRODUCTION .................................................................................................................. 11  3.2   CONTRIBUTIONS ................................................................................................................ 11  3.3   KEY POINTS OF THE MEETING ............................................................................................. 13  

4   LIDER PARTICIPATION IN BIG DATA NETWORKING DAY .............................................. 14  5   CONCLUSION AND NEXT STEPS ....................................................................................... 15  

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 8 of 15

1 Introduction This document, the fourth roadmapping workshop report, summarizes the outcome of three roadmapping activities:

• On 3 December 2014, LIDER held two sessions at the SHARE-PSI 2.0 workshop in Lisbon. These sessions provided valuable feedback from the European public sector information community.

• On 7 January 2015, DFKI held a meeting between researchers from two communities: Linguistic Linked Data (LLD) and Machine Translation (MT). The aim was to discuss research topics of common interest, strategic input to European R&D planning, and potential next steps.

• The LIDER participation in the 16 January 2015 Big Data Networking Day, organised by the European Commission.

This report takes a different approach compared to the previous roadmapping reports. It does not focus on summarizing dedicated LIDER roadmapping workshops, but different types of roadmapping activates. LIDER decided to report on these activities since they give insights in several areas like public sector information and machine translation research, which have not been addressed by previous LIDER activities. All roadmapping activities held by LIDER are listed at

https://www.w3.org/community/ld4lt/wiki/Lider_roadmapping_activities This page will be kept up to date during the duration of the project.

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 9 of 15

2 LIDER Session at SHARE-PSI Event

2.1 Introduction SHARE-PSI 2.0 is a network for innovation in European public sector information (PSI). The main instrument of activity is a series of workshops to gather feedback from the PSI community and beyond. In December 2014, LIDER participants held two sessions at the workshop on encouraging open data usage by commercial developers (3-4 December, Lisbon). The Share-PSI thematic network kindly invited LIDER to participate in the event. The event consisted of parallel “un-conference style”, free discussion sessions which were held in parallel. For this reason, the below summary does not represent the flow of discussion or separate presentations. Both LIDER sessions had about 20 participants each, with quite heterogeneous backgrounds, e.g.:

• Public sector data providers from various domains, e.g. governmental providers from various countries covering different languages, or library representatives.

• Providers of public information systems, deploying public sector information. • Software engineers engaging in the development of public open data portals. • Researchers from various data and language related areas.

2.2 Contributions

Session on “Multilingual PSI data on the Web” The goal of the sessions was to demonstrate the relevance for multilingual (linked) data to the PSI community, and to understand what challenges need to be resolved to foster widespread adoption of multilingual PSI. The session participants included PSI data providers from public administrations, developers of open data portals, and developers of end user applications. Some PSI providers are legally mandated to provide data in several languages, but manual translation is too expensive. Other providers have many metadata records that need cleansing, e.g. conversion to standardized formats, and entity detection in and across languages. The library community is both a provider of relevant entity definitions and in need of cleansing tooling. The following table summarizes technologies needs discussed during the session (“what”), reasons for the needs (“why”) and challenges for their adoption (“how”). The general conclusion was that the technology plays a role especially in B2B scenarios, that is: to connect PSI data providers and application developers. The end user then profits indirectly, e.g. by being enabled to have multilingual search facilities.

What: Technology needs Why: (Commercial B2B or end customer) applications

How: challenges

• Entity detection in context

• Cross-lingual linking • Translation

• Aggregation • (meta)data cleansing

enriching • Language detection

• Availability of mature tooling

• Tooling in the right workflow, tool

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 10 of 15

• Create curated data • Converting standard

non-linked vocabularies to linked data, relying on standard linked data vocabularies

• Data curation • Guess meaning roughly • High quality translation

integration (hiding SPARQL from users)

• Processing of mass data

• Small languages • Handling legacy data • Precision of tooling • Dealing with uncertainty

of processing output (e.g. “Is this an entity of class X?”)

Session on “Making your Data Multilingual and Interoperable” The second session organized by LIDER focused on a specific technical aspect of multilingual linked data: relying on standardized data formats and fostering both syntactic and semantic interoperability. Like in the first session, the outcome was summarized in terms of three dimensions. First, what should be done to publish PSI? One should publish data that is of high quality and resolve semantic conflicts before publishing. One should use a standardized, linked data format and rely on ontologies as well as established terminologies. Re-use is important. One should combine general with domain and application specific data and metadata. One should take the context of data and metadata into account. The representation of multilingual information needs to be designed carefully. Second, why does multilingual standardized linked data facilitate the publication or reuse of PSI? The answer is that rich metadata facilitates the discovery and use of PSI across languages. Third, how can one achieve multilingual linked data and how can you measure or test it? Currently there is no technical means to measure or test the additional value of multilingual linked data. As a basis, the LIDER project provides best practices and guidelines about data publication.

2.3 Key Points of the Sessions The first session clearly showed that there are technical needs in the PSI community for linguistic linked data and language technology. Conversion to a standardized format, data cleansing involving entity linking in and across languages was mentioned frequently. Once PSI data has been improved, the data itself can be used to feed into applications, e.g. cross lingual search, but also to improve language technologies like named entity recognition or machine translation. The second session showed the value of making data multilingual and interoperable by using linked data technologies and standardized ontologies and terminologies. The session helped to raise awareness of why standards in this area are beneficial. Some participants asked for a way to measure the gain by using standardized formats and vocabularies. Currently, there is no technical means to achieve this, but as a basis best practices are being developed.

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 11 of 15

3 Linked Data and Machine Translation

3.1 Introduction On 7 January 2015, DFKI organised a half-day meeting at its Berlin location with participants from key research groups in the realm of machine translation and linguistic linked data. In total there were 19 participants (including six remote participants) from the following organisations:

• Bielefeld University • DFKI • InfAI • Insight (NUIG) • Adapt & Trinity College Dublin • Universita degli Studi di Roma La Sapienza • Universidad Politécnica de Madrid • ADAPT & Dublin City University

Although seven out of the above institutions belong to the LIDER project, seven of the meeting participants do not contribute regularly to LIDER activities. Among these are the coordinators of current and upcoming projects in the area of machine translation. The projects encompass the FP7 project QTLaunchPad and the Horizon 2020 projects QT21 and Cracker. The meeting sought to explore the potential for convergence in topics addressed by the machine translation and linguistic linked data research communities, and specifically to detect joint research themes, use cases and implementation show cases that demonstrated research and business synergies. Here, we provide a summary of the meeting.

3.2 Contributions The discussion can be summarized in three areas:

• Research topics that include aspects of machine translation and linguistic linked data;

• Strategic discussions on how to influence research planning in Europe related to the two areas; and

• Showcases that demonstrate the usefulness of combining linguistic linked data and machine translation.

The last part of the event was dedicated to a discussion of potential next steps.

Research Topics Several research topics were discussed, including the following: 1) The need for specific shared data repositories with LLD resources specific to MT (training) tasks. An example might be representing inflected word forms across languages. From the point of view of language technology, this is often seen as a solved problem. However, representing morphological information as linguistic linked data large scale has not been done so far, although models like LexInfo exist. 2) Interconnecting structured and unstructured language resources. Comparable corpora have given disappointing results in MT training. Using machine learning methods like

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 12 of 15

distant supervision and matching structured, unstructured and partially structured information could lead to better results. Involving large-scale lexical semantic resources like BabelNet to feed lexical correspondences in MT training could help in the absence of parallel corpora. 3) Data management and data integration aspects of MT and LLD. One reason that hinders the usage of LLD in MT training is its graph structure nature leading to performance challenges. Recently, a standardization effort that tackles this challenge in general (beyond the linguistic domain) has started. Using corpora in tabular formats with LLD metadata can lower the barrier to adoption by the NLP communities in MT training and evaluation activities, e.g. WMT tasks. In general, a bootstrapping approach of improving MT and LLD, and vice-versa, seems to be most promising. This would then continuously involve researchers from both communities. The participants recognized that one gap between the communities is knowledge, e.g. about tooling, available corpora, processing workflows etc. This is not a research topic but needs to be addressed as a basis for joint research.

Strategic Discussions The relation of strategic inputs to future EU planning was discussed.

• LIDER http://www.lider-project.eu (Asunción Gómez-Pérez) presented its two page document that summarizes the key dimension of how LLD can be used in language technologies: Linking, content analytics, trust, privacy, universal access to data commons and public services across languages, access to information and services without borders.

• FALCON http://www/falocn-project.eu (Dave Lewis) presented research and economic potentials of using open data in next generation machine translation, using massive large scale bilingual dictionaries like BabelNet as aggregator, contributing to multilingual business scenarios like eCommerce and open data management for public automated translation.

• BabelNet http://babelnet.org/. BabelNet demonstrates how conversion to a common, lexical-conceptual LLD vocabulary allows very large scale resource aggregation with outbound links. These links allow decentralised management of the source resource.

In the current data focused R&D landscape, the most promising role of MT seems to be to contribute to a data value chain scenario. The FREME project http://freme-project.eu/ shows successfully how such an argumentation can be achieved: having certain data focused usage and business scenarios as a starting point, and going deeply into the scenarios for specifying the R&D needs. In the case of FREME the domains are: Digital publishing, translation and localisation, agriculture and food domain data management, and personalisation.

Showcases Several showcases demoed the application of linguistic linked data.

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 13 of 15

• Babelfy (Roberto Navigli) http://babelfy.org. Babelfy is based on the LLD resource BabelNet and used for entity linking and word sense disambiguation. Semantic information identified via Babelfy could help to improve MT.

• LingHub (John McCrae). http://linghub.lider-project.eu/. LingHub provides metadata related to LLD based and general language resources. One usage scenario related to MT is discovery of resources needed for a specific MT implementation.

• Not at the meeting but discussed later: previous work on mining translations from the Web of data1.

So far there is no killer showcase that demonstrates the usefulness of MT for LLD or data applications in general. This is an area for next steps (see below).

3.3 Key Points of the Meeting The meeting did not result in concrete actions and timelines but in several potential areas for collaboration. This is related to current (LIDER) and upcoming (e.g. CRACKER, QT21) projects.

• Promote usage info on LLD resources in MT, to demonstrate their usefulness to the MT community.

• Enhance tabular corpora with LLD metadata to enable more open, repeatable data management for shared MT training and evaluation tasks.

• Develop “killer application showcases” demonstrating the use of MT for making the linked data Web multilingual, and vice versa. This is a two way relationship between MT and LLD. Currently it seems that the benefits of LLD for MT are easier to showcase, e.g. by using a multilingual LLD resource for MT training.

• Consider joint research proposals. A project consortium focusing on linked data experts that understands multilingual aspects is key.

• It was agreed that any strategic research direction for joint MT and LLD research should be informed by current public consensus building efforts with the aim of maximising industry involvement. Ongoing public consensus building activities include:

o Metadata for language resources2 developed with META-SHARE input at the Linked Data for Language Technology (LD4LT)3 community group at the W3C.

o Discussion on open data management for public automated translation4 at the ITS interest group at the W3C, which may provide useful input to the CEF for MT services.

o An initial LLD research roadmap5 developed by the LIDER project, due to be reviewed and revised at the LD4LT group.

o Best practice on multilingual linked open data6, which will address linked data profiles for terminology7 and parallel text8 resources.

1 See http://www.aclweb.org/anthology/W/W13/W13-52.pdf#page=18 2 See http://www.w3.org/community/ld4lt/wiki/Meta-Share_OWL_metamodel 3 See http://www.w3.org/community/ld4lt/ 4 See http://www.w3.org/International/its/wiki/Open_Data_Management_for_Public_Automated_Translation_Services 5 See http://www.lider-project.eu/sites/default/files/D3.2.1-v1.0.pdf 6 See http://www.lider-project.eu/sites/default/files/D2.1.1.Phase-I-v2.0.pdf 7 See http://www.w3.org/community/bpmlod/wiki/Converting_TBX_to_RDF 8 See http://www.w3.org/community/bpmlod/wiki/Draft_Guidelines_on_Bitext_as_Linked_Data

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 14 of 15

4 LIDER Participation in Big Data Networking Day Several participants of LIDER contributed to the 16 January 2015 Big Data Networking Day, which was organised by the European Commission. The event had a workshop dedicated to “Multilingual Data Value Chains in the Digital Single Market” 9. The LIDER participants gave presentations on:

• LIDER in general and the value of linguistic linked data; • The current version of the LIDER roadmap; and on • LIDER related projects like FALCON and FREME.

In addition, LIDER participants contributed to discussions about research and innovation planning in the area of language technologies. Here, LIDER cooperated with the Horizon 2020 projects CRACKER and LT-OBSERVATORY.

9 See https://ec.europa.eu/digital-agenda/en/news/workshop-multilingual-data-value-chains-digital-single-market for details about the event.

FP7-610782

D4.8 Fourth Roadmapping Workshop Report Page 15 of 15

5 Conclusion and Next Steps The three roadmapping activities reported in this document showed the value of going deep into domains (like public sector information), of building bridges to other research communities (like machine translation) and of contributing to networking activities organised by the European Commission. The next steps are different depending on each of these areas. LIDER will continue detailed engagement with the public sector information community in an upcoming SHARE-PSI event. LIDER participants are discussing concrete research proposals with the machine translation community. Finally, LIDER will continue to contribute to the strategic activities of the language technology community, and help to shape upcoming research and innovation planning including both data and language related aspects.