Character-Level Neural Translation for Multilingual Media Monitoring in the SUMMA Project

Reading time: 5 minute
...

📝 Abstract

The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines. Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem.

💡 Analysis

The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines. Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem.

📄 Content

Character-Level Neural Translation for Multilingual Media Monitoring
in the SUMMA Project

Guntis Barzdins, Steve Renals, Didzis Gosko University of Latvia, University of Edinburgh, LETA
Riga 29 Rainis Blvd. IMCS UL, Edinburgh EH8 9AB, Riga 2 Marijas Str. E-mail: guntis.barzdins@lu.lv, s.renals@ed.ac.uk, didzis.gosko@leta.lv Abstract The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines. Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem.

Keywords: clustering, multilingual, translation

  1.   The SUMMA Project Overview Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of storylines (Risen et al., 2013). The massive growth in the number of broadcast and Internet media channels requires innovative ways to cope with this increasing amount of data. It is the aim of SUMMA 1 project to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages. Within SUMMA project three European news broadcasters BBC, Deutche Welle, and Latvian news agency LETA are joining the forces with the University of Edinburgh, University College London, Swiss IDIAP Research Institute, Qatar Computing Research Institute, and Priberam Labs from Portugal to adapt the emerging big data neural deep learning NLP techniques to the needs of the international news monitoring industry.
    BBC Monitoring undertakes one of the most advanced, comprehensive, and large scale media monitoring operations world-wide, providing news and information from media sources around the world. BBC monitoring journalists and analysts translate from over 30 languages into English, and follow approximately 13,500 sources, of which 1,500 are television broadcasters, 1,300 are radio, 3,700 are key news portals world-wide, 20 are commercial news feeds, and the rest are RSS feeds and selected Social Media sources. Monitoring journalists follow important stories and flag breaking news events as part of the routine monitoring.

1 SUMMA (Scalable Understanding of Multilingual MediA) is project 688139 funded by the European Union H2020-ICT-16 The central idea behind SUMMA is to develop a scalable multilingual media monitoring platform (Fig.1) that combines the real-time media stream processing (speech recognition, machine translation, story clustering) with in- depth batch-oriented construction of a rich knowledge base of reported events and entities mentioned, enabling extractive summarization of the storylines in the news.

Figure 1: The components of the SUMMA project.

In this paper we focus only on the streaming shallow processing part of the SUMMA project (the dark block in Fig.1), where the recently developed neural machine translation techniques (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) enable radically new end- BigData-research call. The project started in February 2016 and will last 3 years. to-end approach to machine translation and clustering of the incoming news stories. The approach is informed by our previous work on machine learning (Barzdins, Paikens, Gosko, 2013), media monitoring (Barzdins et al.,2014), and character-level neural translation (Barzdins & Gosko, 2016).
2.   Multilingual Neural Translation Automation of media monitoring tasks has been the focus of the number of earlier projects such as European Media Monitor (emm.newsbrief.eu), EventRegistry (eventregistry.org), xLike (xlike.org), Bison (bison- project.eu), NewsReader (newsreader-project.eu), MultiSensor (multisensorproject.eu), inEvent (invent- project.eu), and xLiMe project (xlime.eu). These predecessor projects are dominated by the p

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut