the summa platform prototype - university of...

The SUMMA Platform Prototype

Renars Liepins†[email protected]

Ulrich [email protected]

Guntis Barzdins† · Alexandra BirchH · Steve RenalsH · Susanne Weberu

Peggy van der Kreefti · Hervé BourlardU · João PrietoJ · Ondřej KlejchH

Peter BellH · Alexandros LazaridisU · Alfonso MendesJ · Sebastian Riedels

Mariana S. C. AlmeidaJ · Pedro BalageJ · Shay CohenH · Tomasz DwojakH

Phil GarnerU · Andreas Gieferi · Marcin Junczys-DowmuntH · Hina Imrani

David NogueiraJ · Ahmed Ali? · Sebastião MirandaJ · Andrei Popescu-BelisU

Lesly Miculicich WerlenU · Nikos PapasarantopoulosH · Abiola Obamuyide‡

Clive Jonesu · Fahim Dalvi? · Andreas Vlachos‡ · Yang WangU · Sibo TongU

Rico SennrichH · Nikolaos PappasU · Shashi NarayanH · Marco DamonteH

Nadir Durrani? · Sameer Khurana? · Ahmed Abdelali? · Hassan Sajjad?

Stephan Vogel? · David Sheppeyu · Chris Hernonu · Jeff Mitchells†Latvian News Agency HUniversity of Edinburgh iDeutsche Welle uBBC

UIdiap Research Institute JPriberam Informatica S.A. sUniversity College London‡University of Sheffield ?Qatar Computing Research Institute

Abstract

We present the first prototype of theSUMMA Platform: an integrated platformfor multilingual media monitoring. Theplatform contains a rich suite of low-leveland high-level natural language process-ing technologies: automatic speech recog-nition of broadcast media, machine trans-lation, automated tagging and classifica-tion of named entities, semantic parsing todetect relationships between entities, andautomatic construction / augmentation offactual knowledge bases. Implemented onthe Docker platform, it can easily be de-ployed, customised, and scaled to large vol-umes of incoming media streams.

1 IntroductionSUMMA (Scalable Understanding of MultilingualMedia)1 is a three-year Research and InnovationAction (February 2016 through January 2019),supported by the European Union’s Horizon 2020research and innovation programme. SUMMA is1 www.summa-project.eu

developing a highly scalable, integrated web-basedplatform to automatically monitor an arbitrarilylarge number of public broadcast and web-basednews sources.

Two concrete use cases and an envisioned thirduse case drive the project.

1.1 Monitoring of External News CoverageBBC Monitoring, a division of the British Broad-casting Corporation (BBC), monitors a broad va-riety of news sources from all over the world onbehalf of the BBC and external customers. About3002 staff journalists and analysts track TV, ra-dio, internet, and social media sources in order todetect trends and changing media behaviour, andto flag breaking news events. A single monitor-ing journalist typically monitors four TV channelsand several online sources simultaneously. Thisis about the maximum that any person can copewith mentally and physically. Assuming 8-hourshifts, this limits the capacity of BBC Monitoringto monitoring about 400 TV channels at any giventime on average. At the same time, BBC Moni-toring has access to about 13,600 distinct sources,2 To be reduced to 200 by the end of March, 2017.

Data feed modules

Live stream monitoring

Light-weight node.jswrappers integratecomponents into theSUMMA infrastructure.

Journalists access thesystem through aweb-based GUI.RethinkDB storesmedia content andmeta-information.RabbitMQ handlescommunication bet-ween database andNLP modules thataugment themonitored data.NLP modules forASR, MT, NamedEntity Tagging, etc.run as applications inDocker containers.Docker Composeorchestrates all thecomponents into aworkinginfrastructure.Docker providescontainerization of thevarious components.

Figure 1: Architecture of the SUMMA Platform

including some 1,500 TV and 1,350 radio broad-casters. Automating the monitoring process notonly allows the BBC to cover a broader spectrumof news sources, but also allows journalists to per-form deeper analysis by enhancing their ability tosearch through broadcast media across languagesin a way that other monitoring platforms do notsupport.

1.2 Monitoring Internal News Production

Deutsche Welle is Germany’s international pub-lic service broadcaster. It provides internationalnews and background information from a Germanperspective in 30 languages worldwide, 8 of whichare used within SUMMA. News production withinDeutsche Welle is organized by language and re-gional departments that operate and create con-tent fairly independently. Interdepartmental col-laboration and awareness, however, is importantto ensure a broad, international perspective. Mul-tilingual internal monitoring of world-wide newsproduction (including underlying background re-search) helps to increase awareness of the work be-tween the different language news rooms, decreaselatency in reporting and reduce cost of news pro-duction within the service by allowing adaptationof existing news stories for particular target audi-ences rather than creating them from scratch.

1.3 Data Journalism

The third use case is data journalism. Measurabledata is extracted from the content available in andproduced by the SUMMA platform and graphicsare created with such data. The data journalismdashboard will be able to provide, for instance, agraphical overview of trending topics over the past24 hours or a heatmap of storylines. It can placegeolocations of trending stories on a map. Cus-tomised dashboards can be used to follow partic-ular storylines. For the internal monitoring usecase, it will visualize statistics of content that wasreused by other language departments.

2 System Architecture

Figure 1 shows an overview of the SUMMA Plat-form prototype. The Platform is implemented asan orchestra of independent components that runas individual containers on the Docker platform.This modular architecture gives the project part-ners a high level of independence in their develop-ment.

The system comprises the following individualprocessing components.

2.1 Data Feed Modules and Live Streams

These modules each monitor a specific news sourcefor new content. Once new content is available, it isdownloaded and fed into the database via a com-mon REST API. Live streams are automatically

Storyline IndexQuick access to current storylines.

Storyline SummaryMulti-item/document summary ofnews items in the storyline.

Individual news storiesWithin the storylineLeft: frame to play the original video(if applicable);Right: tabbed text box with au-tomatic transcription of the originalaudio source, automatic translation(plaintext), and automatic transla-tion with recognized named entitiesmarked up.

Figure 2: Web-based User Interface of the SUMMA Platform (Storyline View)

segmented into logical segments.

2.2 Database Back-endRethink-DB3 serves as the database back-end.Once new content is added, Rethink-DB issues pro-cessing requests to the individual NLP processingmodules via RabbitMQ.

2.3 Automatic Speech RecognitionSpoken language from audio and video streams isfirst processed by automatic speech recognition toturn it into text for further processing. Models aretrained on speech from the broadcast domain us-ing the Kaldi toolkit (Povey et al., 2011); speechrecognition is performed using the CloudASR plat-form (Klejch et al., 2015).

2.4 Machine TranslationThe lingua franca within SUMMA is English. Ma-chine translation based on neural networks is usedto translate content into English automatically.The back-end MT systems are trained with the Ne-matus Toolkit (Sennrich et al., 2017); translationis performed with AmuNMT (Junczys-Dowmuntet al., 2016).

2.5 Entity Tagging and LinkingDepending on the source language, Entity Tag-ging and Linking is performed either natively, or3 www.rethinkdb.com

on the English translation. Entities are detectedwith TurboEntityRecognizer, a named entity recog-nizer within TurboParser4 (Martins et al., 2009).Then, we link the detected mentions to the knowl-edge base with a system based on our submissionto TAC-KBP 2016 (Paikens et al., 2016).

2.6 Topic Recognition and Labeling

This module labels incoming news documents andtranscripts with a fine-grained set of topic labels.The labels are learned from a multilingual corpusof nearly 600k documents in 8 of the 9 SUMMAlanguages (all except Latvian), which were manu-ally annotated by journalists at Deutsche Welle.The document model is a hierarchical attentionnetwork with attention at each level of the hier-archy, inspired by Yang et al. (2016), followed bya sigmoid classification layer.

2.7 Deep Semantic Tagging

The system also has a component that performs se-mantic parsing into Abstract Meaning Representa-tions (Banarescu et al., 2013) with the aim to incor-porate them into the storyline generation eventu-ally. The parser was developed by Damonte et al.(2017).5 It is an incremental left-to-right parserthat builds an AMR graph structure using a neu-4 https://github.com/andre-martins/TurboParser5 Demo at http://cohort.inf.ed.ac.uk/amreager.html.

ral network controller. It also includes adaptationsto German, Spanish, Italian and Chinese.

2.8 Knowledge Base Construction

This component provides a knowledge base of fac-tual relations between entities, built with a modelbased on Universal Schemas (Riedel et al., 2013),a low-rank matrix factorization approach.The en-tity relations are extracted jointly across multi-ple languages, with entities pairs as rows and aset of structured relations and textual patterns ascolumns. The relations provide information abouthow various entities present in news documents areconnected.

2.9 Storyline Construction andSummarization

Storylines are constructed via online clustering,i.e., by assigning storyline identifiers to incomingdocuments in a streaming fashion, following thework in Aggarwal and Yu (2006). The result-ing storylines are subsequently summarized via anextractive system based on Almeida and Martins(2013).

3 User Interface

Figure 2 shows the current web-based SUMMAPlatform user interface in the storyline view. Astoryline is a collection of news items that con-cerning a particular “story” and how it developsover time. Details of the layout are explained inthe figure annotations.

4 Future Work

The current version of the Platform is a prototypedesigned to demonstrate the orchestration and in-teraction of the individual processing components.The look and feel of the page may change signifi-cantly over the course of the project, in responseto the needs and requirements and the feedbackfrom the use case partners, the BBC and DeutscheWelle.

5 Availability

The public release of the SUMMA Platform asopen source software is planned for April 2017.

6 Acknowledgments

This work was conducted within the scopeof the Research and Innovation Action

SUMMA, which has received funding from the Eu-ropean Union’s Horizon 2020 research and innova-tion programme under grant agreement No 688139.

References

Aggarwal, Charu C and Philip S Yu. 2006. “Aframework for clustering massive text and cat-egorical data streams.” SIAM Int’l. Conf. onData Mining, 479–483.

Almeida, Miguel B and Andre FT Martins. 2013.“Fast and robust compressive summarizationwith dual decomposition and multi-task learn-ing.” ACL, 196–206.

Banarescu, Laura, Claire Bonial, Shu Cai,Madalina Georgescu, Kira Griffitt, Ulf Herm-jakob, Kevin Knight, Philipp Koehn, MarthaPalmer, and Nathan Schneider. 2013. “Abstractmeaning representation for sembanking.” Lin-guistic Annotation Workshop.

Damonte, Marco, Shay B. Cohen, and GiorgioSatta. 2017. “An incremental parser for abstractmeaning representation.” EACL.

Junczys-Dowmunt, Marcin, Tomasz Dwojak, andHieu Hoang. 2016. “Is neural machine transla-tion ready for deployment? A case study on 30translation directions.” CoRR, abs/1610.01108.

Klejch, Ondřej, Ondřej Plátek, Lukáš Žilka, andFilip Jurčíček. 2015. “CloudASR: platform andservice.” Int’l. Conf. on Text, Speech, and Dia-logue, 334–341.

Martins, André FT, Noah A Smith, and Eric PXing. 2009. “Concise integer linear programmingformulations for dependency parsing.” ACL,342–350.

Paikens, Peteris, Guntis Barzdins, Afonso Mendes,Daniel Ferreira, Samuel Broscheit, Mariana S. C.Almeida, Sebastião Miranda, David Nogueira,Pedro Balage, and André F. T. Martins. 2016.“SUMMA at TAC knowledge base populationtask 2016.” TAC. Gaithersburg, Maryland,USA.

Povey, Daniel, Arnab Ghoshal, Gilles Boulianne,Lukáš Burget, Ondr̆ej Glembek, Nagendra Goel,Mirko Hannemann, Petr Motlíček, YanminQian, Petr Schwarz, Jan Silovský, Georg Stem-mer, and Karel Veselý. 2011. “The Kaldi speechrecognition toolkit.” ASRU.

Riedel, Sebastian, Limin Yao, Benjamin M. Mar-lin, and Andrew McCallum. 2013. “Relation ex-traction with matrix factorization and universalschemas.” HLT-NAACL.

Sennrich, Rico, Orhan Firat, Kyunghyun Cho,Alexandra Birch, Barry Haddow, JulianHitschler, Marcin Junczys-Dowmunt, SamuelLäubli, Antonio Valerio Miceli Barone, JozefMokry, and Maria Nadejde. 2017. “Nematus: atoolkit for neural machine translation.” EACLDemonstration Session. Valencia, Spain.

Yang, Zichao, Diyi Yang, Chris Dyer, XiaodongHe, Alexander J. Smola, and Eduard H. Hovy.2016. “Hierarchical attention networks for doc-ument classification.” NAACL. San Diego, CA,USA.

the summa platform prototype - university of...

Documents