Character-Level Neural Translation for Multilingual Media Monitoring in the SUMMA Project
📝 Abstract
The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines. Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem.
💡 Analysis
The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines. Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem.
📄 Content
Character-Level Neural Translation for Multilingual Media Monitoring
in the SUMMA Project
Guntis Barzdins, Steve Renals, Didzis Gosko
University of Latvia, University of Edinburgh, LETA
Riga 29 Rainis Blvd. IMCS UL, Edinburgh EH8 9AB, Riga 2 Marijas Str.
E-mail: guntis.barzdins@lu.lv, s.renals@ed.ac.uk, didzis.gosko@leta.lv
Abstract
The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation
(MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program
ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines.
Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring. We address these two
problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation
models. To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention
mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level
rather than on the word-level. The story segmentation and storyline clustering problem is tackled by examining the low-dimensional
vectors produced as a side-product of the neural translation process. The results of this paper describe a novel approach to the automatic
story segmentation and storyline clustering problem.
Keywords: clustering, multilingual, translation
- The SUMMA Project Overview
Media monitoring enables the global news media to be
viewed in terms of emerging trends, people in the news, and
the evolution of storylines (Risen et al., 2013). The massive
growth in the number of broadcast and Internet media
channels requires innovative ways to cope with this
increasing amount of data. It is the aim of SUMMA 1
project to significantly improve media monitoring by
creating a platform to automate the analysis of media
streams across many languages.
Within SUMMA project three European news broadcasters
BBC, Deutche Welle, and Latvian news agency LETA are
joining the forces with the University of Edinburgh,
University College London, Swiss IDIAP Research
Institute, Qatar Computing Research Institute, and
Priberam Labs from Portugal to adapt the emerging big
data neural deep learning NLP techniques to the needs of
the international news monitoring industry.
BBC Monitoring undertakes one of the most advanced, comprehensive, and large scale media monitoring operations world-wide, providing news and information from media sources around the world. BBC monitoring journalists and analysts translate from over 30 languages into English, and follow approximately 13,500 sources, of which 1,500 are television broadcasters, 1,300 are radio, 3,700 are key news portals world-wide, 20 are commercial news feeds, and the rest are RSS feeds and selected Social Media sources. Monitoring journalists follow important stories and flag breaking news events as part of the routine monitoring.
1 SUMMA (Scalable Understanding of Multilingual MediA) is project 688139 funded by the European Union H2020-ICT-16 The central idea behind SUMMA is to develop a scalable multilingual media monitoring platform (Fig.1) that combines the real-time media stream processing (speech recognition, machine translation, story clustering) with in- depth batch-oriented construction of a rich knowledge base of reported events and entities mentioned, enabling extractive summarization of the storylines in the news.
Figure 1: The components of the SUMMA project.
In this paper we focus only on the streaming shallow
processing part of the SUMMA project (the dark block in
Fig.1), where the recently developed neural machine
translation techniques (Sutskerev, Vinyals & Le, 2014;
Bahdanau, Cho & Bengio, 2014) enable radically new end-
BigData-research call. The project started in February 2016 and
will last 3 years.
to-end approach to machine translation and clustering of
the incoming news stories. The approach is informed by our
previous work on machine learning (Barzdins, Paikens,
Gosko, 2013), media monitoring (Barzdins et al.,2014),
and character-level neural translation (Barzdins & Gosko,
2016).
2. Multilingual Neural Translation
Automation of media monitoring tasks has been the focus
of the number of earlier projects such as European Media
Monitor
(emm.newsbrief.eu),
EventRegistry
(eventregistry.org),
xLike
(xlike.org),
Bison (bison-
project.eu),
NewsReader
(newsreader-project.eu),
MultiSensor
(multisensorproject.eu),
inEvent (invent-
project.eu),
and
xLiMe
project
(xlime.eu). These
predecessor projects are dominated by the p
This content is AI-processed based on ArXiv data.