The State of Speech in HCI: Trends, Themes and Challenges

Reading time: 5 minute
...

📝 Original Info

  • Title: The State of Speech in HCI: Trends, Themes and Challenges
  • ArXiv ID: 1810.06828
  • Date: 2018-03-15
  • Authors: : Cohen, Cheyer, Horvitz, El Kaliouby, Whittaker, Alm, Todman, Elder, Newell 등.

📝 Abstract

Speech interfaces are growing in popularity. Through a review of 68 research papers this work maps the trends, themes, findings and methods of empirical research on speech interfaces in HCI. We find that most studies are usability/theory-focused or explore wider system experiences, evaluating Wizard of Oz, prototypes, or developed systems by using self-report questionnaires to measure concepts like usability and user attitudes. A thematic analysis of the research found that speech HCI work focuses on nine key topics: system speech production, modality comparison, user speech production, assistive technology \& accessibility, design insight, experiences with interactive voice response (IVR) systems, using speech technology for development, people's experiences with intelligent personal assistants (IPAs) and how user memory affects speech interface interaction. From these insights we identify gaps and challenges in speech research, notably the need to develop theories of speech interface interaction, grow critical mass in this domain, increase design work, and expand research from single to multiple user interaction contexts so as to reflect current use contexts. We also highlight the need to improve measure reliability, validity and consistency, in the wild deployment and reduce barriers to building fully functional speech interfaces for research.

💡 Deep Analysis

Figure 1

📄 Full Content

Speech has become a more prominent way of interacting with automatic systems. In addition to long established telephony based or interactive voice response (IVR) interfaces, voice enabled intelligent personal assistants (IPAs) like Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana are widely available on a number of devices. Home-based devices such as Amazon Echo, Apple HomePod, and Google Home are increasingly using speech as the primary form of interaction. The market for IPAs alone is projected to reach $4.61 billion by the early 2020s (Kamitis, 2016). The technical infrastructures underpinning speech interfaces have advanced rapidly in recent years and is the subject of extensive research in the speech technology community (Chan, Jaitly, Le, & Vinyals, example of speech interfaces are spoken dialogue systems (SDS), where the system and user can interact through spoken natural language. The dialogue involved can range from highly formulaic question-answer pairs where the system keeps the initiative and users respond, through HMIHY ('How may I help you?') systems where the user can formulate wider queries and the system computes the optimal next move, to systems which appear to allow user initiative by permitting users to interrupt ('barge-in') and change task. However, at present, system dialogs are quite stiff and just how closely they simulate human conversation is open to debate. SDSs generally follow a common pipeline design to engage in dialogue with users. They first detect that a user is addressing the system. This can include identifying the user addressing it from a range of possible users (speaker diarization). The system's ASR then recognizes what is said and passes recognition results to the NLU component. The function of the NLU is to identify the intent behind the user's utterance, and express it in machine-understandable form, often as a dialog act. The system dialog manager (DM) then selects an appropriate action to take based on the identified intent, considering factors such as the current state of the dialogue and recent dialog history. A natural language generation (NLG) component then generates a natural response, which is outputted by the system as artificial speech using text-to-speech synthesis (TTS) (Jokinen & MacTear, 2010;Lison & Kennington, 2016). Some speech interfaces are used in conjunction with other input/output modalities to facilitate interaction (Weinschenk & Barker, 2000). For example, speech can be used as either input or output along with graphical user interfaces (GUIs), commonly seen in speech dictation for word-processing tasks, or in screen reading technology to support the navigation of websites. Indeed, SDS's such as IPAs (e.g. Siri) often display content and request user input through the screen as well as through speech (Cowan et al., 2017).

While interest in speech interfaces has been growing steadily (Cohen, Cheyer, Horvitz, El Kaliouby, & Whittaker, 2016;Munteanu et al., 2017;Munteanu & Penn, 2014) there is no clear idea of what forms the core of speech-based work in HCI. This makes it difficult to identify novel areas of research and the challenges faced in the HCI field, particularly for those new to the topic. As speech gains in popularity as an interface modality, it is important that the state of speech research published in the HCI community is clearly mapped, so that those who come to the research have a clear idea of the major trends, topics and methods. The current paper aims to achieve this by reviewing empirical work published across a range of leading HCI venues. We also hope this may guide and inform future endeavours in the field, by identifying opportunities for further research. Below we report the method used to conduct the review, our findings, and discuss challenges for future research efforts based on our results.

We reviewed 68 publications on user interactions with speech as either a system output (e.g. Alm, Todman, Elder, & Newell, 1993), user input (e.g. Harada, Wobbrock, Malkin, Bilmes, & Landay, 2009) or in a dialogue context (e.g. Cowan, Branigan, Obregón, Bugis, & Beale, 2015;Porcheron, Fischer, & Sharples, 2017). Papers were selected using adapted PRISMA guidelines, similar to the adapted QUORUM procedures in previous reviews (Bargas-Avila & Hornbaek, 2011;Mekler, Bopp, Tuch, & Opwis, 2014).

Three databases were searched for relevant publications in January 2018: ACM Digital Library (ACM DL), ProQuest (PQ) and Scopus (SP). Each database was searched using terms generated from keywords in existing speech literature, and from a survey of 11 leading researchers in speech HCI and speech technology (see Table 1). The terms were searched as exact phrases and, where possible, combined using Boolean operators (e.g. “OR”). Otherwise, terms were searched for individually. Searches were limited to terms appearing in the title, abstract, and publication keywords, and were also limited to journal articles and conference

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut